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1  Statement  of  Work 

Short-  and  Long-term  Effects  in  Prostate  Cancer  Survival:  Analysis  of  Treatment  Efficacy  and 

Risk  Prediction 

Alexander  Tsodikov,  Ph.D. 

In  the  course  of  the  project  there  has  been  no  change  in  the  scope  of  work.  All  tasks  have  been 

completed.  A  breakdown  below  shows  what  has  been  accomplished. 

Task  1.  Develop  model-building  techniques 

Task  2.  Develop  estimation  and  hypothesis  testing 

Task  3.  Develop  variable  selection  procedures 

Task  4.  Analysis  of  the  data  for  significant  effects 

Task  5.  Computer-intensive  approaches  to  prognosis  and  validation 

2  Objectives 

There  has  been  no  change  in  the  project  objectives.  The  specific  aims  of  this  project  are 

1.  To  provide  a  statistical  model  that  reproduces  the  complex  survival  responses  in  prostate  cancer. 

2.  To  develop  methodology  for  analysis  of  prognosis  after  treatment  for  prostate  cancer  taking  into 
account  the  long-  and  short-term  effects  of  prognostic  factors  and  treatment. 

3.  To  develop  statistical  software  implementing  model-building,  estimation,  construction  of  prog¬ 
nostic  indices,  conditional  survival  prognosis,  and  assessment  of  the  quality  of  prognostic  clas¬ 
sifications  based  on  the  new  models. 

4.  To  apply  the  models  and  methodology  to  analyze  post-treatment  survival  of  patients  with 
prostate  cancer  using  data  from  the  Memorial  Sloan  Kettering  Cancer  Center  and  the  SEER 
database. 


3  Introduction 

This  project  represents  a  successful  effort  to  develop  abstract  statistical  theory,  computational 
algorithms,  translate  this  methodology  into  stable  shareware  software  products  that  can  be  used 
by  the  broad  scientific  community,  put  this  product  into  the  R-Projects  nltm  and  rpNLTM  that 
has  become  the  dominant  site  for  dissemination  of  cutting  edge  statistical  procedures,  and  finally 
use  and  showcase  all  this  arsenal  to  address  real  data  and  problems  in  prostate  cancer.  We  are 
glad  that  we  took  the  challenge  of  this  large  idea  development  and  translational  effort  in  the  three 
year  project  performance  period  and  that  were  able  to  see  it  through. 

The  goal  of  this  proposal  was  to  investigate  a  novel  approach  to  the  analysis  of  post-treatment 
survival  of  prostate  cancer  patients:  the  decomposition  of  the  diversity  of  survival  patterns  into 
short-term  and  long-term  effects.  We  proposed  to  identify  a  model  of  prostate  cancer  survival 
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incorporating  long-  and  short-term  effects  of  prognostic  factors  and  treatment.  Novel  statistical 
tools  were  developed  to  make  such  models  work  for  better  prognosis  of  prostate  cancer  patients. 
Year  1  at  the  University  of  Utah  was  primarily  devoted  to  development  of  methodology  for  point 
estimation  and  hypothesis  testing.  While  continuing  methodological  research  in  Year  2,  we  focused 
on  the  delivery  aspect  of  the  progect  addressing  software  development  and  implementation  of  the 
algorithms,  testing  them  by  simulations,  development  of  tools  for  multivariate  analysis  and  variable 
selection  and  preliminary  applications  of  these  tools  to  real  data.  In  the  last  3rd  year  of  the  grant 
(a  no-cost  extention),  we  focused  on  justifications  for  asymptotic  theory,  development  of  computer¬ 
intensive  data  mining  tools  using  our  models,  developing  an  R-package  that  would  give  a  broad 
scientific  community  access  to  the  free  software  tools  implementing  the  methodology  developed  in 
this  project,  and  on  applications  of  these  methods  to  analyze  data  on  survival  of  patients  treated 
for  prostate  cancer  from  Sloan-Kettering  Cancer  Center  Database  and  SEER  database. 

In  the  following  sections  of  the  report  we  give  a  summary  of  the  results  achieved  in  this  project. 


4  Methodology 

4.1  Models  and  inference  procedures 

Motivated  by  second-order  properties  of  frailty  models  we  have  proposed  a  family  of  so-called 
Nonlinear  Transformation  Models  (NTM)  (Year  2  report,  Section  4).  The  models  were  supplied 
with  a  general  numerical  inference  framework  based  on  the  QEM  algorithm  (Year  1  report,  Section 
4).  We  developed  composition  techniques  that  allowed  us  to  easily  extend  NTMs  in  a  hierarchical 
fashion  into  more  complex  models  (Year  1  report,  Section  5,  Year  2  report  Section  7).  We  proved 
that  the  QEM  estimation  algorithm  will  fit  any  model  constructed  using  the  techniques.  This 
framework  was  used  to  come  up  with  a  flexible  family  of  models  that  incorporate  long-  and  short¬ 
term  survival  effects.  We  have  extensively  studies  the  properties  of  this  estimation  procedure  by 
simulations  (Year  2  report  Section  8).  In  order  for  the  models  to  be  useful  for  the  analysis  of 
prostate  cancer,  we  developed  a  hypothesis  testing  framework.  We  started  with  the  traditional 
likelihood  ratio  test  and  variable  selection  procedures  (Year  1  report,  Section  6,  Year  2  report 
Section  5),  and  then  developed  sophisticated  techniques  that  allowed  us  to  obtain  exact  observed 
profile  information  matrix  for  the  models  (Year  2  report  Section  5).  Coupled  with  the  Wald  test, 
this  method  made  complex  hypothesis  testing  computationally  tractable.  Subsequent  sections  of 
this  report  describe  development  of  computer-intensive  model  selection  and  hypothesis  testing 
procedures  and  application  of  the  whole  machinery  to  two  large  datasets  on  survival  of  prostate 
cancer  patients. 

4.2  Computer-intensive  model  selection 

This  section  describes  the  work  performed  in  Year  3. 

Traditional  forward  and  backward  variable  selection  procedures  as  well  as  two  computer  in¬ 
tensive  extended  procedures  (a  Tree-Based  procedure  for  the  forward  search  and  a  Backwards 
Pooling  procedure  for  the  backwards  search)  were  developed  for  all  the  models  incorporated  in  the 
software. 

Variable  selection  is  organized  by  cycles  of  elementary  hypotheses  testing  followed  by  a  selection 
of  the  best  model  in  the  current  cycle.  The  cycling  is  repeated  until  neither  of  the  potential  models 
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evaluated  at  the  current  cycle  satisfies  a  certain  criteria  for  a  continued  search. 

Technically,  in  our  implementation  of  variable  selection  procedures,  we  always  work  with 
the  maximal  model  containing  all  model  parameters  representing  regression  coefficients  f3  = 
(Po,  Pi,  ■  ■  ■ ,  Pk)-  A  non-cure  model  typically  will  not  have  an  intercept  term  /30-  Forward  vari¬ 
able  selection  procedures  put  maximal  restrictions  on  the  model  parameters  and  work  forward 
by  removing  those  restrictions  sequentially  until  either  an  unrestricted  model  emerges  or  a  cer¬ 
tain  criterion  is  met.  With  the  backwards  selection  procedures,  an  unrestricted  maximal  model 
is  the  starting  point.  The  procedure  then  adds  restrictions  sequentially  until  either  a  maximally 
restricted  model  is  achieved,  or  a  certain  stopping  criterion  is  met. 

4.2.1  Traditional  variable  selection  procedures 

With  the  forward  selection  procedure,  a  model  where  all  model  parameters  representing  regression 
coefficients  (3  =  (ft  i,  ■  ■  ■  ,  Pk)  are  fixed  at  zero.  This  corresponds  to  the  hypothesis  of  homogeneity. 
The  cycle  of  elementary  hypothesis  testing  consists  of  evaluating  a  likelihood  ratio  characterizing 
the  improvement  in  the  goodness  of  fit  resulting  from  removing  a  restriction  for  one  of  the  regression 
coefficients  f3t  =  0.  Model  with  the  smallest  p-value  for  testing  the  hypothesis  represented  by  the 
restriction  is  chosen  as  the  next  base  model,  i.e.  a  model  on  which  the  next  cycle  will  be  based. 
The  stopping  occurs  if  all  such  p-values  are  larger  than  a  certain  threshold. 

Forward  procedures  are  subject  to  criticism  that  until  the  best  model  is  achieved,  hypothesis 
testing  is  based  on  a  misspecified  model,  and  therefore  the  validity  of  p-values  may  be  a  suspect. 
This  consideration  brings  us  to  the  forward  procedures.  Speed  is  an  advantage  of  forward  selection 
procedures  as  only  models  with  degrees  of  freedom  less  than  that  of  the  best  model  are  evaluated. 
Speed  can  further  be  improved  by  using  a  score  test  instead  of  likelihood  ratio  that  requires  fitting 
each  potential  model  being  evaluated  (currently  not  implemented). 

In  the  forward  selection  procedure  an  unrestricted  maximal  model  is  the  starting  point.  Re¬ 
strictions  of  the  type  H0  :  (it  —  0  are  evaluated  at  each  point  as  the  procedure  cycles  through  all 
potential  models  in  the  current  cycle  where  i  goes  through  all  yet  unrestricted  parameters.  Model 
showing  the  largest  p-value  exceeding  a  certain  threshold  is  selected  as  the  base  model  for  the 
next  cycle.  The  procedure  is  stopped  when  non  of  such  p-values  exceed  the  threshold  or  when  all 
parameters  have  been  restricted.  By  the  nature  of  this  procedure,  relatively  big  models  with  a 
sizable  fraction  of  non-significant  parameters  are  evaluated  until  the  best  one  is  achieved.  Speed 
of  the  likelihood  ratio  test  that  requires  fitting  each  potential  model  is  a  challenge.  In  order  to 
improve  on  the  speed  of  search,  a  Wald  test  have  been  implemented  for  all  the  models.  The  Wald 
test  described  in  Section  ??  uses  our  novel  results  on  the  exact  profile  information  matrix  (Section 
??).  When  occasional  problems  with  information  matrix  inverses  occur  in  the  presence  of  many 
non-significant  parameters  (typically  with  highly  overparameterized  maximal  models  at  the  few 
initial  dimension  reduction  steps),  the  procedure  uses  likelihood  ratio  test  instead,  and  tries  to 
swich  back  to  the  Wald  test  immediately  after. 

4.2.2  Backwards  pooling  procedure 

In  the  backwards  pooling  procedure,  two  kinds  of  restrictions  are  considered: 

•  Fixing  ith  parameter  Ho  :  Pi  =  0,  and 

•  Pooling  ith  and  jth  parameter  H0  :  Pi  =  Pj,  i  j . 
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When  the  pooling  hypothesis  is  first  accepted,  a  cluster  of  two  pooled  parameters  is  created. 
Later  on,  any  free  parameter  or  a  different  cluster  of  pooled  parameters  could  be  merged  with  the 
first  one.  Note  that  by  the  nature  of  the  imposed  restrictions,  any  cluster  of  pooled  parameters 
contributes  one  degree  of  freedom  (one  parameter-representative)  to  the  model,  and  the  pooling 
restriction  forces  all  other  mmembers  of  the  cluster  to  be  equal  to  the  parameter-representative. 

Unlike  the  traditional  backwards  variable  selection  procedure,  its  Backwards  Pooling  general¬ 
ization  is  a  much  more  flexible  data  mining  tool.  Indeed,  the  number  of  potential  models  considered 
on  each  cycle  is  an  order  of  magnitude  larger  than  in  the  traditional  backwards  selection  procedure, 
and  is  equal  to  A(A— l)/2+A,  where  A  =  (#  of  free  parameters  +  #  of  pooled  parameter  clusters) 
To  understand  the  difference,  consider  a  model  with  single  predictor  f/jz  and  one  categorical  covari- 
ate  z.  If  simple  contrast  is  used,  z  will  be  represented  by  a  number  of  dummy  variables  comparing 
every  possible  category  of  z  to  a  selected  baseline  category.  Traditional  variable  selection  in  this 
situation  would  preserve  any  effect  showing  significant  difference  with  the  baseline.  At  the  same 
time  this  ’’‘best’”  model  may  still  be  imperfect  and  overparameterised  as  some  categories  may 
show  similar  effects,  and  even  though  they  might  be  significantly  different  from  the  baseline,  they 
may  show  no  difference  whatsoever  among  themselves.  The  Backwards  Pooling  procedure  will 
in  this  case  continue  variable  selection  until  all  differences  between  clusters  of  pooled  parameters 
are  significant.  With  z  representing  a  categorized  continuous  veriable,  for  example,  the  procedure 
can  be  used  to  select  optimal  outpoints  on  z  that  divide  the  sample  into  an  optimal  number  of 
groups  maximally  seperated  in  terms  of  risk  predicted  by  z.  With  factorial  parameterization  (the 
one  that  includes  all  possible  interactions)  of  a  set  of  categorical  covariates,  the  output  of  the 
Backwards  Pooling  procedure  is  conceptually  similar  to  a  pruned  regression  tree,  or  to  an  output 
of  a  clustering  algorithm,  where  groups  correponding  to  distinct  patterns  of  the  covariate  vector  z 
are  clustered  so  that  the  difference  within  each  cluster  in  minimized,  while  the  difference  between 
clusters  is  maximized. 

4.2.3  Tree-based  methods 

Traditional  regression  tree  methodology  [Breiman  et  ah,  1984]  is  based  on  recursive  partitioning 
of  the  data  using  splits  defined  by  outpoints  put  on  covariates.  The  optimal  cuntpoint  at  each 
step  of  the  procedure  typically  maximizes  a  two-sample  test  statistic  (minimizes  the  p- value).  In 
survival  analysis  a  logrank  test  or  any  other  traditional  rank  test  can  be  used  to  define  the  splits. 
We  recognize,  however,  that  logrank  test  is  a  score  test  for  the  Proportional  Hazards  model,  and 
is  sensitive  to  the  long-term  effect  in  the  presence  of  long-term  survivors.  When  long-  and  short¬ 
term  effects  are  counteractive,  logrank  test  can  be  dramatically  underpowered.  In  Year  2  report 
(Figure  6,  page  20;  also  see  [Wendland  et  ah,  2004])  we  discussed  a  breast  cancer  example  where 
due  to  counteracting  long-  and  short-term  effects  survival  curves  crossed,  and  the  logrank  test 
showed  a  p- value  of  0.79  while  the  test  of  homogeneity  and  separate  tests  for  long-  and  short-term 
effects  based  on  the  PHPH  model  showed  a  highly  significant  result.  Prostate  cancer  biochemical 
failure  data  from  MSKCC  also  show  crossing  curves  (Figure  ??).  We  therefore  believe  that  in 
the  presence  of  both  long-  and  short-term  effects,  and  particularly  with  crossing  survival  curves, 
PHPH  or  similar  cure  model  with  two  predictors  should  be  used  as  a  basis  for  deriving  the  two- 
sample  test.  This  will  insure  that  regression  tree  would  still  be  sensitive  to  differences  departing 
from  the  proportional  hazards  assumption. 

We  have  developed  a  general  tree-based  data  mining  procedure  where  splits  are  based  on  a 
two-sample  test  derived  from  any  available  Nonlinear  Transformation  Model  (NTM). 
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Figure  1:  Time  to  biochemical  failure.  MSKCC  data.  Dose  categories  are  0  (lowest)  through 
3  (highest).  Prognostic  category  is  composed  of  PSA  and  Gleason  score,  categorized  0  (best) 
through  2  (worst). 
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Given  a  data  set  and  an  NTM  the  procedure  does  the  following: 

1.  Recursively  splits  the  data. 

2.  Finds  the  AIC  of  each  node’s  model 

3.  Plots  the  tree  with  the  AIC  information 

Splitting  is  organized  according  to  the  following  algorithm. 

1.  For  a  continuous  covariate  x  with  distinct  values  xi,  x2,  ■  ■  ■ ,  Xk,  all  possible  splits  of  the  form 
x  <  Xi  are  considered.  This  is  done  via  an  indicator  dummy  variable  that  equals  1  if  x  <  Xi, 
and  0  otherwise.  Using  a  subset  of  the  data  corresponding  to  the  node  being  considered  for 
splitting,  an  NTM  is  fitted  with  the  indicator  covariate  as  the  only  variable  in  the  model.  A  one 
step  of  the  Powell  optimization  procedure  is  used  with  the  profile  likelihood  as  a  target  function. 
Profile  likelihood  corresponding  to  the  outcome  is  retained  as  a  criterion  for  the  goodness  of 
split. 

2.  For  a  discrete  covariate  x,  with  values  {ai, ..,  a*,},  all  possible  subsets  of  {ai, ...,  (ik }  are  consid¬ 
ered.  For  a  given  subset,  the  indicator  dummy  variable  takes  the  value  1  if  x  is  in  the  subset, 
and  0  otherwise.  Evaluation  of  each  such  model  is  done  as  in  the  case  of  a  continuous  covariate. 

3.  Among  all  splits  considered,  the  one  with  the  largest  profile  likelihood  is  kept.  This  produces  2 
descendant  nodes  where  the  procedure  is  repeated. 

4.  The  stopping  criterion  is  based  on  the  minimal  number  of  observations  allowed  in  a  node. 

AIC  is  calculated  using  a  full  model  fitting  approach  at  each  node.  At  a  given  node,  using  all 

the  data,  a  model  with  covariates  corresponding  to  the  node  in  question  and  all  its  parent  nodes 

is  fitted.  Node  with  the  smallest  AIC  corresponds  to  the  best  model. 

With  a  cure  NTM  with  two  predictors,  same  sets  of  covariates  are  used  for  long-  and  short 

term  effects. 


5  Computer  Software  and  tools 

This  section  describes  the  work  performed  in  Year  2-3. 

5.1  Delphi  Package 

The  first  version  of  the  Delphi  package  EMc  was  described  in  our  Year  2  report,  Section  9.  In  Year 
3  the  package  has  seen  many  improvements  and  extensions.  Key  additions  include  development  of 
the  Backward  Pooling  variable  selection  procedure  (Section  4.2.2)  and  incorporation  of  the  profile 
information  matrix  theory  (Section  ??)  into  hypothesis  testing,  confidence  intervals  and  variable 
selection  blocks.  These  additions  helped  reduce  the  computational  burden  of  hireling  the  best 
model  for  a  large  dataset  such  as  MSKCC  or  SEER  databases  to  the  point  that  such  analysis 
and  data  mining  became  feasible  (Section  6).  On  the  surface,  these  additions  are  seemless  -  when 
Backwards  Pooling  variable  selction  procedure  is  complete,  the  internal  structure  of  the  software 
is  populated  with  the  best  model.  Detailed  log  and  result  hies  are  created. 

The  Figures  3-13  represent  the  basic  fucntionality  of  the  package. 
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Figure  2:  Reading  data  into  the  Delphi  EMc  package. 
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Figure  3:  Browsing  for  a  data  file. 
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Figure  4:  Read  the  data  and  data  entry  log. 
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Figure  5:  Select  covariates  to  include  in  the  maximal  model  and  define  their  types. 


DAMD  7-03-1-0034 


Page  13 


Final  Report 


FINAL  PROGRESS  REPORT 


PI:  Tsodikov,  Alexander 


Figure  6:  Data  slicing.  Show  survival  curves  corresponding  to  distinct  patterns  of  categorical 
variables  (groups). 
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Figure  7:  Choose  a  model  to  be  fitted. 
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Figure  8:  Interim  output  of  the  likelihood  maximization  procedure. 
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Figure  9:  Superimposing  survival  curves  expected  under  the  model  on  observed  Kaplan-Meier 
estimates  for  distinct  patterns  of  categorical  variables. 
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.|*  (c)  Alex  Tsodikov  (2004)  Nonlinear  transformation  models:  Generalized  self-consistency  approach 
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Figure  10:  Performing  a  likelihood  ratio  test  to  compare  nested  models. 
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♦  |*  (c)  Alex  Tsodikov  (2004)  Nonlinear  transformation  models:  Generalized  self-consistency  approach 
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Figure  11: 


Point  estimates  and  confidence  intervals  for  fitted  model  parameters. 
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Figure  12:  Variable  selection  procedures  and  interim  output. 
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Figure  13:  Flexible  model  restrictions.  Manually  specifying  a  restricted  model  to  test  specific 
hypotheses. 
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Figure  14:  Link  to  nltm  package  on  the  R  project  web  page. 

5.2  R  package  nltm 

The  nltm  package  implements  the  basic  functionality  of  EMc  (Section  5.1)  in  the  R  environment 
without  the  Delphi  GUI.  Although  these  two  packages  are  implementations  of  the  same  basic 
methodology,  the  source  code  is  written  in  two  different  languages.  EMc  is  written  in  Delphi  7  by 
Inprise,  a  visual  and  object  oriented  version  of  the  Pascal  programming  language.  The  R  package 
nltm  is  written  in  c  and  R  with  the  subsequent  compilation  as  an  R  package.  The  source  codes 
and  a  brief  package  description  is  available  on  the  R  webpage  (Figure  14).  Compiled  versions  for 
Unix  and  Windows  will  be  produced  by  the  R  project  team  from  the  source  codes,  and  at  this 
point  the  package  will  be  directly  installable  from  the  user’s  R  environment.  In  the  meanwhile, 
both  packages  compiled  for  Windows  are  available  from  the  PI  for  a  manual  installation.  The 
following  models  are  currently  supported  by  nltm: 

•  Proportional  hazard  model  (PH): 

G(t\z)  =  F(t)e « 

•  Proportional  hazard  cure  model  (PHC): 

G(t\z)  =  exp(-0(z)(l  -  F(t))) 
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Proportional  odds  model  (PO): 


G(t|s)  - 

Proportional  hazard  -  proportional  hazard  cure  model  (PHPHC): 

G(t\z)  =  exp(-0(z)(l  -  F^z\t))) 

Proportional  hazard  -  proportional  odds  cure  model  (PHPOC): 

cm  =  exp  {-^)i_(|:^)F(t) 


Gamma  frailty  model  (GFM): 


G{t\z)  = 


6(z)v('z'>  1^ 

e(z)-ln(F(t))_ 


Proportional  hazard  -  proportional  odds  model  (PHPO): 


G(t\z)  = 


i  -  (i  -  e)Fr>(*)(t) 


The  following  R  command  represents  the  syntax  for  a  call  to  an  estimation  procedure 

nltm(f ormula=f ormula(data) ,  data=parent . frame () ,  subset,  na. action, 
init,  control,  model=c("PHM , "PHC" , "PD" , "PHPHC" , "PHPOC" , "GFM" , "PHPO") , 
verbose=FALSE,  . . . ) , 


where  the  arguments  have  the  following  meaning 

formula  A  formula  object,  with  the  response  on  the  left  of  a  operator,  and  the  terms  on  the  right. 
The  response  must  be  a  survival  object  as  returned  by  the  Surv  function. 

data  A  data. frame  in  which  to  interpret  the  variables  named  in  the  formula,  or  in  the  subset 
argument. 

subset  Expression  saying  that  only  a  subset  of  the  rows  of  the  data  should  be  used  in  the  fit. 

na. action  A  missing-data  filter  function,  applied  to  the  model. frame,  after  any  subset  argument  has  been 
used.  Default  is  options () $na. action. 

init  Vector  of  initial  values  for  the  calculation  of  the  maximum  likelihood  estimator  of  the  regression 
parameters.  Default  initial  value  is  zero. 

control  Object  of  class  coxph .  control  specifying  iteration  limit  and  other  control  options.  Default  is 
nltm . control (...). 

model  A  character  string  specifying  a  non-linear  transformation  model.  Default  Proportional  Hazards 
Model. 


verbose  If  TRUE  it  stores  information  from  maximization  of  likelihood  and  calculation  of  information 
matrix  in  a  file.  Default  is  FALSE. 


DAMD  7-03-1-0034 


Page  23 


Final  Report 


FINAL  PROGRESS  REPORT 


PI:  Tsodikov,  Alexander 


...  Other  arguments 

The  procedure  returns  a  value  (object)  of  the  class  "  'coxph"  ’ .  Thus,  in  its  usage  the  package 
is  similar  to  existing  R  package  survival. 

The  following  is  an  example  of  usage  in  R. 

1.  Create  a  simple  test  data  set  with  four  variables,  time  (survival  time),  status  (censoring  index), 
and  two  covariates,  age  (categorized  age  in  years  represented  by  the  right  point  of  the  bin),  and 
size  (tumor  size  in  mm). 

testl  <-  list (time=c (10 ,7 , 32,23,22,6,16,34,32,25,11,20,19,6,17,35,6,13,9,6,1), 
status=c(l ,  1,  0,  1,  1,  1,  1,  0,  0,  0,  0,  0,  0,  1,  0,  0,  1,  1,  0,  0,  0), 
size=c (1.79, 7. 93, 2. 02, 6. 89, 2. 30, 7. 82, 1.25, 9. 85, 6. 02, 3. 43, 4. 72, 7. 45, 8. 83, 
9.53,1.10,1.06,5.25,5.86,2.03,3.62,3.52), 
age=f actor (c (65 ,65,65,65,99,45,65,99,99,99,65,45,65,55,45,45,55,55,55,99,65))) 

2.  Call  ntml  procedure  to  fit  the  PO  model 

nltm(Surv(time , status)  ~  size  +  age,  data=testl,  model="P0") 

results  in  a  table  of  point  estimates,  standard  errors  and  p-values  for  dropping  the  term. 

Call: 

nltm( formula  =  Surv(time,  status)  ~  size  +  age,  data  =  testl, 
model  =  "PHPHC" ) 


coef 

exp(coef ) 

se (coef ) 

z 

P 

size 

0.141 

1 . 15e+00 

0.174 

0.812 

0.420 

age55 

5.000 

1 . 48e+02 

9.763 

0.512 

0.610 

age65 

1.736 

5 . 67e+00 

1.219 

1.424 

0.150 

age99 

-0.264 

7.68e-01 

1.450 

-0.182 

0.860 

size 

-0.117 

8.90e-01 

0.234 

-0.500 

0.620 

age55 

-4.975 

6.91e-03 

9.815 

-0.507 

0.610 

age65 

-3.353 

3.50e-02 

1.796 

-1.866 

0.062 

age99 

-3.358 

3.48e-02 

2.251 

-1.492 

0.140 

cure 

-1.920 

1.47e-01 

1.368 

-1.403 

0.160 

Likelihood  ratio  test=10.3  on  9  df ,  p=0.325  n=  21 

3.  A  similar  call  to  the  PHPH  cure  model  with  long-  and  short-term  predictors  results  in  a  table 
twice  the  size  of  the  one  for  a  PO  model,  where  the  first  set  of  coefficients  corresponds  to  long¬ 
term  effects  of  the  covariate,  and  the  second  set  of  coefficients  corresponds  to  short-term  effects 
of  the  same  covariates. 
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nltm(Surv(time , status)  ~  size  +  age,  data=testl,  model="PHPHC") 
Call: 

nltm(f ormula  =  Surv(tirae,  status)  ~  size  +  age,  data  =  testl, 
model  =  "PHPHC" ) 


coef 

exp(coef ) 

se (coef ) 

z 

P 

size 

0.141 

1 . 15e+00 

0.174 

0.812 

0.420 

age55 

5.000 

1 . 48e+02 

9.763 

0.512 

0.610 

age65 

1.736 

5 . 67e+00 

1.219 

1.424 

0.150 

age99 

-0.264 

7.68e-01 

1.450 

-0.182 

0.860 

size 

-0.117 

8.90e-01 

0.234 

-0.500 

0.620 

age55 

-4.975 

6.91e-03 

9.815 

-0.507 

0.610 

age65 

-3.353 

3.50e-02 

1.796 

-1.866 

0.062 

age99 

-3.358 

3.48e-02 

2.251 

-1.492 

0.140 

cure 

-1.920 

1.47e-01 

1.368 

-1.403 

0.160 

Likelihood  ratio  test=10.3  on  9  df ,  p=0.325  n=  21 


5.3  R  package  rpNLTM 

The  R  package  rpNLTM  implements  Recursive  Partitioning  and  Regression  Trees  algorithms  using 
splits  with  criteria  supplied  by  fitting  an  NTM  model  using  nltm  package  functions. 

The  following  R  command  represents  the  syntax  for  a  call  to  an  estimation  procedure 

rpNLTM (formula,  data,  weights,  subset,  na.action=na.rpart,  method, 

model=FALSE,  x=FALSE,  y=TRUE,  parms,  control,  cost,  control .nltm, 

model . nltm=c ("PH" , "PHC" , "PO" , "PHPHC" , "PHPOC" , "GFM" , "PHPO" ) ,  verbose=FALSE,  . . .) 


where  the  arguments  have  the  following  meaning 


formula 

data 


subset 
na. action 


A  formula,  as  in  the  lm  function. 

An  optional  data  frame  in  which  to  interpret  the  variables  named  in  the  formula  weights  optional 
case  weights. 

Optional  expression  saying  that  only  a  subset  of  the  rows  of  the  data  should  be  used  in  the  fit. 

The  default  action  deletes  all  observations  for  which  y  is  missing,  but  keeps  those  in  which  one 
or  more  predictors  are  missing. 


method  One  of  ”anova’,  '"poisson”,  ’’class”,  ”exp”  or  '’nltm'".  ”nltm”  method  is  of  primary  interest 
as  it  triggers  regression  trees  procedures  based  on  the  methodology  developed  in  this  project. 
The  other  methods  are  inherited  from  the  existing  package  ”rpart”  and  are  not  related  to  this 
project. 
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model  If  logical:  keep  a  copy  of  the  model  frame  in  the  result?  If  the  input  value  for  model  is  a  model 
frame  (likely  from  an  earlier  call  to  the  rpart  function),  then  this  frame  is  used  rather  than 
constructing  new  data. 

x  Keep  a  copy  of  the  x  matrix  in  the  result. 

y  Keep  a  copy  of  the  dependent  variable  in  the  result.  If  missing  and  model  is  supplied  this 
defaults  to  FALSE. 

parms  Optional  parameters  for  the  splitting  function.  Anova  splitting  has  no  parameters.  Poisson 
splitting  has  a  single  parameter,  the  coefficient  of  variation  of  the  prior  distribution  on  the 
rates.  The  default  value  is  1.  Exponential  splitting  has  the  same  parameter  as  Poisson.  For 
classification  splitting,  the  list  can  contain  any  of:  the  vector  of  prior  probabilities  (component 
prior),  the  loss  matrix  (component  loss)  or  the  splitting  index  (component  split).  The  priors 
must  be  positive  and  sum  to  1.  The  loss  matrix  must  have  zeros  on  the  diagnoal  and  positive 
off-diagonal  elements.  The  splitting  index  can  be  gini  or  information.  The  default  priors  are 
proportional  to  the  data  counts,  the  losses  default  to  1,  and  the  split  defaults  to  gini. 

control  Options  that  control  details  of  the  rpart  algorithm. 

cost  A  vector  of  non-negative  costs,  one  for  each  variable  in  the  model.  Defaults  to  one  for  all 
variables.  These  are  scalings  to  be  applied  when  considering  splits,  so  the  improvement  on 
splitting  on  a  variable  is  divided  by  its  cost  in  deciding  which  split  to  choose. 

control  .nltm  Object  of  class  coxph.  control  specifying  iteration  limit  and  other  control  options.  Default  is 
nltra . control (...). 

model. nltm  A  character  string  specifying  a  non-linear  transformation  model.  Default  Proportional  Hazards 
Model. 

verbose  If  TRUK  it  stores  information  from  maximization  of  likelihood  and  calculation  of  information 
matrix  in  a  hie.  Default  is  FALSE. 

.  .  .  arguments  to  rpart .  control  may  also  be  specified  in  the  call  to  rpart.  They  are  checked 
against  the  list  of  valid  arguments. 

The  procedure  returns  an  object  of  class  rpart,  a  superset  of  class  tree. 

The  following  is  an  example  of  usage. 

1.  Introduce  a  dataset. 


leukl  <-  list (time=c (10 ,7, 32 ,23,22,6,16,34,32,25,11,20,19,6,17,35,6,13,9,6,10), 
status=c(l ,  1,  0,  1,  1,  1,  1,  0,  0,  0,  0,  0,  0,  1,  0,  0,  1,  1,  0,  0,  0), 
stage=factor(c(2,  1,  0,  0,  1,  2,  1,  1,  2,  0,  0,  1,  0,  0,  2,  0,  2,  1,  1,  0,  0)), 
size=c (1.79, 7. 93, 2. 02, 6. 89, 2. 30, 7. 82, 1.25, 9. 85, 6. 02, 3. 43, 4. 72, 7. 45, 8. 83, 
9.53,1.10,1.06,5.25,5.86,2.03,3.62,3.52), 

age=f actor (c (65 ,65,65,65,99,45,65,99,99,99,65,45,65,55,45,45,55,55,55,99,65))) 


2.  Generate  a  regression  tree  based  on  the  PO  model. 
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fit  <-  rpNLTM(f ormula=Surv(time , status)  ~  size  +  age,  data=leukl, 
method="nltm" ,  model .nltm="P0" ,  verbose=TRUE,  minsplit=5) 

3.  Plot  the  regression  tree. 

plot (fit) 

4.  Mark  te  nodes  of  the  tree. 

text (fit,  pretty=TRUE,  all=TRUE) 

The  result  is  shown  on  Figure  15. 


6  Data  Analysis 

This  section  describes  the  work  performed  in  Year  3. 


6.1  Memorial  Sloan  Kettering  Cancer  Center  Database 

We  have  conducted  data  analysis  using  four  different  endpoints. 

Biochemical  This  endpoint  is  defined  by  ASTRO  as  three  successive  prostate  specific  antigen  (PSA)  elevations 
observed  from  a  post-treatment  nadir  PSAlevcl.  This  endpoint  is  also  termed  ”‘PSA  relapse’” 
or  ”‘PSA  failure’”  (IJROBP  37:1035-1041  1997). 

Local  failure  Local  failure  is  defined  as  palpable  recurrence  and/or  positive  re-biopsy  (Biopsy-confirmed  re¬ 
currence). 


Distant  Distant  failure  represents  detection  of  distant  metastasis  (DM). 

Survival  This  failure  id  defined  as  prostate  cancer-specific  death  (cause-specific  failure). 

Other  than  local  failure,  all  end  points  examined  may  stem  from  two  sources:  subclinical 
metastases  already  present  at  the  time  of  treatment  or  shedding  of  tumor  cells  that  were  not 
sterilized  during  radiation  therapy. 

The  significance  of  models  accomodating  long-  and  short-term  effects  developed  in  this  project 
is  that  they  allow  us  to  reproduce  complex  the  timing  patterns  of  failures  resulting  from  failures 
originating  from  different  biological  sources. 

The  long-term  effect  represents  a  combination  of  long-term  patient’s  prognosis  based  on  clinical 
characteristics  of  the  disease  at  diagnosis,  and  the  treatment  effect  representing  the  chance  that 
we  eradicate  tumor  cells  at  the  time  of  treatment.  It  is  therefore  expected  that  with  the  prognostic 
index  being  fixed,  a  more  radical  treatment  would  result  in  a  higher  cure  rate. 

At  the  same  time  treatment  may  affect  metastatic  cells  is  a  different  way  or  exert  a  ’’‘survival 
of  the  fittest”’  effect  on  the  residual  tumor  cells  giving  rise  to  a  failure.  These  effects  have  the 
potential  of  altering  the  dynamics  and  timing  of  the  development  of  failure  without  necessarily 
affecting  the  long-term  survival  determined  by  the  probability  that  no  residual  tumor  cells  are 
present  rather  than  by  their  biological  characteristics. 
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Figure  15:  An  example  of  the  regression  tree  built  using  a  PO  nonlinear  transformation  model. 
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Having  the  biostatistical  instrument  at  our  disposal  that  recognizes  the  above  mentioned  com¬ 
plexity,  we  approach  the  analysis  using  a  nonlinear  transformation  cure  models  with  two  predictors. 

In  our  preliminary  data  analysis  reported  in  Year  2  progress  report  (Section  8)  we  have  iden¬ 
tified  two  models  potentially  suitable  for  complex  survival  responses  observed  in  prostate  cancer: 
the  Gamma  Frailty  model  with  covariates  incorporated  into  its  scale  and  shape  parameter,  and  the 
so-called  PHPH  cure  model.  We  also  observed  that  PHPH  model  was  the  only  one  that  allowed 
us  to  reproduce  crossing  survival  curves.  Having  obtained  an  extended  version  of  the  MSKCC 
database  of  patients  undergoing  radiation  therapy  for  prostate  cancer,  we  have  observed  crossing 
survival  curves  for  some  of  the  covariate  groups  as  shown,  for  example,  in  Figure  1.  Based  on  this 
observation,  we  decided  to  select  the  PHPH  model  as  our  primary  tool  of  data  analysis.  Our  prior 
experience  suggests,  however,  that  other  cure  models  with  two  predictors  such  as  PHPOC,  or  a 
two  component  mixture  model,  would  give  very  similar  results. 

The  population  examined  in  this  study  consists  of  1765  patients  with  biopsy-confirmed  local¬ 
ized  prostate  cancer  who  were  treated  at  MSKCC  with  external-beam  radiation  therapy  (EBRT). 
Information  on  local  failure  was  available  for  only  1275  patients.  Clinical  and  treatment  charac¬ 
teristics  of  patients  were  summarized  in  two  categorical  variables:  dose  (0,  1,  2,  3)  and  prognosis 
(0,  1,  2).  The  four  dose  levels  are  corresponding  to  the  intervals  [41.4,70.2],  [71.6,75.6],  [75.6,79.2] 
and  [84.6,86.4]  Gy,  and  three  prognosis  categories  (low,  intermediate  and  high  risk)  were  defined 
by  pre-treatment  PSA,  Gleason  score  and  tumor  stage. 

6.1.1  Biochemical  failure 

Relative  risk  estimates  and  estimated  probabilities  of  long-term  survival  resulting  from  computer¬ 
intensive  backwards  pooling  model  selection  procedure  (Section  4.2.2)  are  shown  in  Table  1. 


Relative 

Risk 

Prognosis 

group 

Dose 

group 

0 

1 

2 

3 

Long- 

0 

1.0 

0.27 

Term 

1 

1.65 

1.0 

0.54 

0.27 

Effect 

2 

2.24 

1.65 

1.0 

Short- 

0 

1.0 

3.45 

1.99 

1.0 

Term 

1 

1.99 

6.70 

Effect 

2 

3.45 

6.70 

Probability 

0 

0.59 

0.87 

of  long-term 

1 

0.42 

0.59 

0.75 

0.87 

survival 

2 

0.31 

0.42 

0.59 

Table  1:  Relative  risk  estimates  for  long-  and  short-term  effects  for  the  final  model  for  biochemical 
failure  endpoint.  MSKCC  database  analysis. 

Shown  in  the  following  Figure  16  is  a  detailed  EMc  package  output  that  was  used  to  build  the 
table. 

Based  on  these  result,  we  can  make  the  following  key  conclusions  (all  effects  are  highly  signif¬ 
icant,  see  Figure  16  for  details). 
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Figure  16:  Output  of  the  EMc  package  representing  the  final  model  for  biochemical  failure. 
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1.  Overall,  increasing  dose  improves  the  chance  of  long-term  survival  of  localized  disease  patients 
with  any  prognosis. 

2.  In  the  favorable  prognostic  group  dose  levels  2  and  3  represent  a  possible  over-treatment.  This 
is  evident  from  the  fact  that  long-term  effects  for  dose  levels  1,2,  and  3  were  pooled  in  search 
for  the  best  model. 

3.  Prognosis  groups  1  and  2  show  more  rapid  development  of  failure  with  dose  increasing  from 
level  1  to  2.  This  leads  to  the  speculation  that  75.6  Gy  is  a  possible  cutpoint  when  dose  starts 
affecting  the  dynamics  of  tumor  re-growth  after  radiation  therapy. 

Shown  in  Figure  17  is  a  regression  tree  built  using  the  methodology  developed  in  this  project 
(Section  4.2.3).  While  the  tree  does  not  allow  us  to  directly  test  statistical  hypotheses,  it  provides 
a  data  mining  tool  that  can  suggest  a  direction  for  further  clinical  trials.  For  example  as  evident 
from  Figure  17,  lowest  AIC  corresponds  to  nodes  5  and  6.  These  nodes  represent  a  partition 
of  dose  for  low  PSA  high  grade  patients  (node  5),  and  higher  PSA  early  stage  patients  (node 
6),  which  indicates  possible  subgroups  of  patients  where  increasing  radiation  dose  may  have  the 
highest  effect. 

6.1.2  Local  failure 

Relative  risk  estimates  and  estimated  probabilities  of  long-term  survival  resulting  from  computer¬ 
intensive  backwards  pooling  model  selection  procedure  (Section  4.2.2)  applied  to  the  local  failure 
endpoint  are  shown  in  Table  2. 


Relative 

Prognosis 

Dose  group 

Risk 

group 

0 

1 

2 

3 

Long- 

0 

1.0 

Term 

1 

2.89 

1.0 

Effect 

2 

2.89 

1.0 

Short- 

0 

1.0 

Term 

1 

2.85 

1.0 

Effect 

2 

2.85 

Probability 

0 

0.93 

of  long-term 

1 

0.81 

survival 

2 

0.81 

0.93 

Table  2:  Relative  risk  estimates  for  long-  and  short-term  effects  for  the  final  model  for  the  local 
failure  endpoint.  MSKCC  database  analysis. 

From  the  table  it  is  clear  that  the  local  failure  endpoint  is  much  less  sensitive  to  the  treatment 
dose  than  a  biochemical  failure.  This  could  be  explained  in  part  by  the  smaller  number  of  events 
observed  for  this  endpoint  and  the  associated  power  deficiency.  Based  on  the  table  we  may  suggest 
the  following  conclusions. 

1.  Overall,  increasing  dose  is  associated  with  a  highly  significant  improvement  in  the  long-term 
survival. 
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Figure  17:  Regression  tree  built  for  biochemical  failure  endpoint 
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2.  Dose  levels  1,2,  and  3  may  represent  an  over-treatment  of  patients  with  the  most  favorable 
prognosis.  This  is  evident  from  the  fact  that  all  dose  levels  were  pooled  together  in  long-  and 
short-term  predictors  for  the  favorable  prognosis  group  (prognosis=0). 

3.  Patients  with  intermediate  or  unfavorable  prognosis  benefit  primarily  from  highest  dose  levels. 

4.  Although  the  short-term  effect  is  significant,  we  find  it  difficult  to  interpret  due  to  a  lack  of 
clear  regularity. 

Local  failure  endpoint  is  not  informative  enough  to  build  a  meaningful  regression  tree. 

6.1.3  Distant  metastasis 

Relative  risk  estimates  and  estimated  probabilities  of  long-term  survival  resulting  from  computer¬ 
intensive  backwards  pooling  model  selection  procedure  (Section  4.2.2)  applied  to  the  distant  metas¬ 
tasis  endpoint  are  shown  in  Table  3. 


Relative 

Risk 

Prognosis 

group 

Dose  group 

0 

1 

2  3 

Long- 

Term 

Effect 

0 

1.0 

1 

7.32 

3.69 

2 

25.3 

18.7 

Short- 

Term 

Effect 

0 

1.0 

1 

2 

Probability 
of  long-term 
survival 

0 

0.82 

1 

0.82 

0.92 

2 

0.56 

0.65 

Table  3:  Relative  risk  estimates  for  long-  and  short-term  effects  for  the  final  model  for  the  distant 
metastasis  endpoint.  MSKCC  database  analysis. 


Based  on  the  table  we  may  suggest  the  following  conclusions. 

1.  Overall,  increasing  dose  is  associated  with  a  reduced  chance  to  develop  distant  metastasis  (highly 
significant). 

2.  No  dose  effect  on  metastasis  is  observed  for  patients  with  favorable  prognosis.  This  may  be  due 
to  a  lack  of  power  as  such  patients  generally  have  a  low  chance  to  develop  metastases. 

3.  There  is  no  short-term  effect,  and  the  optimal  model  for  metastasis  endpoint  is  essentially  a 
proportional  hazards  model. 

Shown  in  Figure  18  is  a  regression  tree  built  using  the  methodology  developed  in  this  project 
(Section  4.2.3).  As  is  evident  from  the  figure,  there  is  a  well  structured  prognostic  groups  subdivi¬ 
sion  for  early  stage  patients.  Also  the  tree  suggests  that  increasing  dose  might  benefit  early  stage 
high  grade  patients. 
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Figure  18:  Regression  tree  built  for  the  distant  metastasis  endpoint 
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6.1.4  Cause-specific  survival 

Relative  risk  estimates  and  estimated  probabilities  of  long-term  survival  resulting  from  computer¬ 
intensive  backwards  pooling  model  selection  procedure  (Section  4.2.2)  applied  to  the  cause-specific 
survival  endpoint  are  shown  in  Table  4. 


Relative 

Risk 

Prognosis 

group 

Dose  group 

0 

12  3 

Long- 

Term 

Effect 

0 

1.0 

1 

1.0 

2 

1.9 

4.02 

Short- 

Term 

Effect 

0 

1.0 

1 

9.64 

2 

28.79 

9.64 

Probability 
of  long-term 
survival 

0 

0.93 

1 

0.82 

2 

0.69 

0.46 

Table  4:  Relative  risk  estimates  for  long-  and  short-term  effects  for  the  final  model  for  the  cause- 
specific  survival  endpoint.  MSKCC  database  analysis. 

Based  on  the  table  we  may  suggest  the  following  conclusions. 

1.  Except  for  patients  with  unfavorable  prognosis,  no  dose  effect  is  observed  on  the  prostate-specific 
long-term  survival.  This  may  be  due  to  lack  of  power. 

2.  A  controversial  adverse  long-term  effect  of  dose  is  observed  in  the  unfavorable  prognosis  group. 
This  Ending  stands  in  need  of  explanation. 

3.  Patients  unfavorable  prognosis  show  short-term  benefit  associated  with  increased  treatment 
dose. 

Shown  in  Figure  19  is  a  regression  tree  built  using  the  methodology  developed  in  this  project 
(Section  4.2.3).  The  only  node  involving  dose  has  a  suboptimal  AIC,  which  suggests  that  there 
might  not  be  a  dose  effect  on  cause-specific  survival. 

6.2  SEER  public  database 

SEER  public  database  offers  an  opportunity  to  study  the  effects  of  surgery  and  radiation  in  subsets 
of  patients  defined  by  stage  and  grade.  In  order  to  avoid  confounding  due  to  dissemination  of  the 
PSA  test  in  the  population  in  the  late  80ies  and  lead-time,  length  bias  and  overdiagnostic  bias 
that  dramatically  affect  survival  curves  during  this  transient  period,  we  focused  on  cases  diagnosed 
after  1990.  A  subset  of  23,606  such  cases  diagnosed  in  the  San-Francisco  area  was  selected  for  the 
analysis.  Since  a  combination  of  surgery  and  radiotherapy  is  very  uncommon  in  prostate  cancer, 
we  do  not  show  estimates  pertaining  to  this  group  of  patients. 

Relative  risk  estimates  resulting  from  computer-intensive  backwards  pooling  model  selection 
procedure  (Section  4.2.2)  applied  to  the  cause-specific  survival  endpoint  are  shown  in  Table  5. 
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Prostate  Cancer  Specific  Survival 


Figure  19:  Regression  tree  built  for  the  prostate  cancer-specific  survival  endpoint 
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Stage 

Localized/Regional 

Distant 

Treatment 

No  RT 
No  Surg 

No  RT 
Surgery 

RT 

No  Surg 

No  RT 
No  Surg 

No  RT 
Surgery 

RT 

No  Surg 

Long-term 

Effect 

Low  Grade 

1.0 

0.19 

2.42 

1.0 

0.41 

1.0 

High  Grade 

3.96 

1.0 

2.42 

1.83 

0.41 

1.83 

Short-term 

Effect 

Low  Grade 

1.0 

1.0 

0.25 

1.0 

1.0 

1.93 

High  Grade 

1.52 

0.48 

1.0 

1.0 

1.93 

Table  5:  Relative  risk  estimates  for  long-  and  short-term  effects  for  the  final  model  for  the  cause- 

specific  survival  endpoint.  SEER  database  analysis. 

Based  on  the  table  we  may  suggest  the  following  conclusions. 

1.  Surgery  shows  long-term  advantage  over  no  treatment,  presumably  watchful  waiting,  accross 
stages  and  grades. 

2.  A  controversial  adverse  long-term  effect  of  radiation  is  observed  in  low  grade  patients.  A  possible 
explanation  might  be  that  the  effect  is  confounded  by  PSA.  PSA  measurements  are  only  available 
in  SEER  for  a  couple  of  recent  years.  The  decision  to  treat  with  radiotherapy  may  show  a  positive 
correlation  with  PSA  levels  at  diagnosis,  and  therefore  low  grade  localized  stage  patients  treated 
by  radiotherapy  may  have  higher  PSA  levels  than  similar  watchful  waiting  patients. 

3.  Long-term  survival  rates  for  high  grade  localized  patients  treated  by  surgery  (predominantly 
radical  prostatectomy)  are  superior  to  those  of  radiotherapy. 

4.  Both  surgery  and  radiotherapy  also  show  a  short-term  advantage  in  high-grade  localized  tumors. 

5.  In  low-grade  localized  patients,  only  surgery  shows  a  short-term  advantage. 

6.  Surgery  is  the  only  treatment  showing  an  effect  in  SEER  distant  stage. 

7.  Short-term  effect  of  either  surgery  or  radiation  in  distant  stage  is  an  adverse  one. 

7  Key  Research  Accomplishments 

Simmarizing,  the  key  research  accomplishments  of  the  project  are: 

1.  Development  of  the  class  of  Nonlinear  Transformation  Models  (NTM)  and  associated  QEM 
estimation  procedures  and  their  computer  implementation; 

2.  Development  of  composition  technique  as  a  tool  for  model  building. 

3.  Development  of  a  numerically  efficient  algorithm  for  estimation  of  the  inverse  of  profile  infor¬ 
mation  matrix  for  the  models  that  paved  the  way  to  efficient  model  selection  and  hypothesis 
testing  procedures. 

4.  Development  of  shareware  software  that  makes  these  powerful  tools  available  for  broad  scientific 
community. 
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5.  Multivariate  computer-intensive  regression  analysis  of  two  large  prostate  cancer  databases  that 
allowed  us  to  draw  many  non-trivial  conclusion  about  the  efficacy  of  prostate  cancer  treatment 
in  various  subsets  of  patients  defined  by  clinical  and  prognostic  characteristics  of  their  tumors 
(Section  6) 
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9  Conclusions 

We  have  completed  methodology  and  software  development  for  point  and  interval  estimation  and 
variable  selection  for  compound  Nonlinear  Transformation  Models.  We  have  built  a  number  of 
candidate  compound  models  for  prostate  cancer  and  verified  their  properties  analytically  and  by 
simulations.  We  used  the  new  software  and  methodology  to  apply  these  models  to  a  number  of 
real  and  simulated  test  data  sets.  This  methodology  and  software  arsenal  was  used  to  identify 
subsets  of  patients  showing  varying  treatment  effects.  Broadly  speaking,  we  found  that  increasing 
radiotherapy  dose  is  benefitial  for  biochemical  recurrence,  local  failure  and  distant  metastasis,  and 
has  the  potential  to  improve  long-term  survival  except  perhaps  in  a  subgroup  of  patients  with  most 
favorable  prognosis.  However,  no  proven  benefit  was  discovered  as  far  as  cause-specific  survival 
is  concerned,  which  rases  the  question  of  whether  improving  local  control  in  prostate  cancer  is 
an  optimal  strategy  to  reduce  mortality  from  the  disease.  Analysis  of  population  registry  data 
indicates  that  radical  prostatectomy  may  have  an  advantage  over  radiotherapy  in  some  subsets  of 
patients  with  localized  disease.  However,  this  result  may  be  confounded  by  missing  PSA  data  in 
SEER  registry. 
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Abstract 

Introducing  a  random  effect  into  the  Cox  model  is  a  useful  tool  for 
building  hierarchical  families  of  univariate  semiparametric  regression 
survival  models.  Hougaard  [1984]  used  the  Laplace  transform  to  build 
frailty  models  with  explicitly  defined  survival  functions  and  random  ef¬ 
fects.  The  family  derived  from  stable  distributions  was  then  extended 
[Aalen,  1992]  to  frailty  variables  following  a  Discrete-Continuous  com¬ 
pound  (Poisson-Gamma)  structure.  Still,  in  this  form  the  techniques 
applies  only  to  a  subset  of  frailty  models.  In  this  paper  we  extend  the 
idea  of  compounding  first  to  arbitrary  frailty  models  and  then  to  non¬ 
frailty  Nonlinear  Transformation  Models  (NTM).  EM  algorithm  can 
be  used  to  provide  inference  with  frailty  models.  Motivated  by  sec¬ 
ond  moment  properties  of  frailty  models,  Tsodikov  [2003]  generalized 
the  EM  algorithm  into  a  non-frailty  frame  represented  by  the  Quasi- 
EM  algorithm  (QEM).  We  derive  a  chain  rule  showing  that  QEM 
will  fit  any  model  constructed  using  the  new  composition  technique, 
provided  it  is  applicable  to  the  submodels.  Simulations,  real  data 
and  a  variety  of  models  are  used  to  illustrate  the  composition  tech¬ 
nique.  Non-identifiability  aspect  of  semiparametric  frailty  models  is 
discussed.  Many  important  modelling  issues  and  links  are  highlighted. 
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1  Introduction 


The  model  diversity  in  semiparametric  regression  analysis  has  been  largely 
developed  on  a  case  by  case  basis.  Models  include  the  Proportional  Hazards 
(PH)  model  [Cox,  1972],  the  Proportional  Odds  model  (PO)  [Bennett,  1983], 
generalized  Odds-rate  model  [Dabrowska  and  Doksum,  1988],  linear  trans¬ 
formation  models  [Cheng  et  al.,  1995],  models  motivated  by  frailties  such  as 
cure  models  [Tsodikov  et  al.,  2003].  Despite  efforts  to  built  a  universal  frame¬ 
work  of  statistical  inference  with  semiparametric  models,  including  general 
instruments  for  model  building,  numerical  estimation  algorithms,  identifia- 
bility  and  asymptotics,  general  and  practical  results  are  still  a  challenge,  and 
most  existing  inferential  tools  are  model-specific.  In  this  paper  we  develop  a 
fragment  of  a  general  approach  for  a  class  of  semiparametric  models  equipped 
with  model-building  and  inferential  algorithms. 

Motivated  by  second-order  properties  of  frailty  models  Tsodikov  [2003] 
proposed  a  family  of  so-called  Nonlinear  Transformation  Models  (NTM)  and 
supplied  it  with  a  general  numerical  inference  framework  based  on  the  QEM 
algorithm,  a  subset  of  recently  developed  MM  algorithms  [Lange  et  al.,  2000]. 
In  this  paper  we  will  equip  the  NTM-QEM  frame  with  a  model-building  tool 
that  allows  us  to  generate  a  span  of  hierarchical  models  from  a  set  of  basis 
models  such  as  PH  and  PO,  such  that  they  remain  within  the  frame  and 
QEM  is  guaranteed  to  work  on  any  descendant  model.  When  none  of  the 
basis  models  is  suitable  for  the  data,  the  technique  offers  an  automatic  way 
of  combining  features  of  simpler  models  in  search  of  a  more  complex  model 
that  would  be  right  for  the  data  and  such  that  inference  procedures  are  still 
available. 

Let  7 (a;  1 6)  be  a  parametrically  specified  strictly  increasing  distribution 
function  on  [0,1],  where  0  is  a  vector  of  parameters.  To  define  an  NTM, 
we  first  make  7  a  regression  model  by  turning  6  =  (6 1, . . .  ,9  k)  into  a  set 
of  predictors  depending  on  covariates,  2,  and  regression  coefficients  (3  = 
(0i,  •  •  ■  ,0fc)-  Typically,  0i(0i,z)  =  exp {(3jz}.  Thus,  7  explicitly  represents 
the  parametric  part  of  the  model  and  is  called  an  NTM  generating  function. 
Denote  by  A f  the  class  of  all  NTM  generating  functions. 

Let  F(t)  be  a  nonparametrically  specified  (a  step-function)  baseline  sur¬ 
vival  function.  In  a  Nonlinear  Transformation  Model  it  is  assumed  that 
survival  function  G(t  \  (3,  z )  can  be  represented  as 

G(t  |  0,  z)  =  7  {F(t)  |  0,  z}  =  (7  o  F)(t  |  0,  z).  (1) 
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Consider  a  sample  of  right  censored  data  under  non-informative  censoring. 
The  plug-in  form  (1)  induces  an  Von-Mises  style  likelihood  whose  Frechet 
derivative  with  respect  to  F  leads  to  a  self-consistency  score  equation 

h‘ =  (2) 

where  ht  =  H(t)  —  H(t  —  0)  is  the  jump  of  the  baseline  cumulative  hazard 
H  =  —  log  F  at  time  t,  lZf  is  a  set  of  subjects  at  risk  at  time  t,  ti  are  their 
event  times,  c  is  a  censoring  indicator  (c  =  1  for  a  failure,  c  =  0  for  a  censored 
observation),  and  0  is  a  parametric  function  defined  through  7  as 


0  [x  |  •,  c 


y( c+1)(a;  I  •) 

C  +  X - Ty- - : — — 


(3) 


where  7^(x  |-)  is  the  derivative  of  7  of  the  ith  order  with  respect  to  x, 
7(0)  =  -y.  For  a  frailty  model,  0  is  the  conditional  expectation  of  the  frailty 
variable  given  observed  data  and  represents  the  result  of  the  E-step  of  an 
EM  algorithm  [Tsodikov,  2003]. 

Solving  (2)  by  iterations 


H^k+l)  =  v?(ih(fc)),  (4) 

where  tp  denotes  the  right  part  of  the  self-consistency  equation  (2)  as  a  func¬ 
tional  of  H  —  —  log  F,  and  k  counts  iterations,  is  an  QEM  algorithm.  Its 
point  of  convergence  represents  a  fixed  point  of  p>  and  an  NPMLE  of  F  (or 
H)  given  (3.  Note  that  in  this  form  the  algorithm  is  not  based  on  missing 
data  and  is  applicable  to  non-frailty  models.  For  a  frailty  model  though,  this 
is  an  EM  algorithm.  It  can  be  shown  that  if  0(x|-)  is  a  non-decreasing  func¬ 
tion  of  x,  which  is  the  case  for  all  frailty  models,  then  each  QEM  iteration 
improves  the  likelihood,  which  is  a  key  property  for  convergence.  The  solu¬ 
tion  of  the  self-consistency  equation  can  be  written  as  an  implicit  function  of 
(3,  F  =  F(j3).  Plugging  this  solution  into  the  full  loglikelihood  £((3,  F)  gives 
us  the  profile  likelihood 


£pr((3)  =  £((3,F((3)),  (5) 

that  is  used  to  provide  inference  for  f3,  [Murphy  and  Van  der  Vaart,  2000, 
Tsodikov,  2003]. 
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Aside  from  a  broader  class  of  models,  the  advantage  of  the  NTM-QEM 
approach  over  the  traditional  frailty-EM  one  is  that  analytic  work  required 
to  specify  the  algorithm  and  verify  its  applicability  is  minimized.  Given  a 
new  model  G  =  7 (F),  with  the  traditional  frailty  framework,  one  would  first 
have  to  verify  that  7  is  a  completely  monotonic  function  (Bernstein  theorem, 
[Feller,  1971])  to  ensure  that  the  new  model  is  a  frailty  model.  Then  one 
needs  to  invert  the  Laplace  transform  to  find  the  distribution  of  the  frailty 
random  variable  U .  With  the  conditional  distribution  of  U,  given  observed 
data,  the  conditional  expectation  of  U  is  derived  to  provide  missing  data 
imputation.  Closed  form  expressions  are  required  throughout  to  ensure  a 
numerically  efficient  algorithm.  NTM-QEM  approach  makes  the  above  exer¬ 
cise  obsolete  and  boils  down  to  taking  two  derivatives  of  7  and  verifying  that 
0  (3)  is  nondecreasing.  QEM  is  faster  than  the  traditional  EM  that  uses 
partial  likelihood  at  the  M-step  even  with  models  that  have  a  single  predic¬ 
tor  [Tsodikov,  2003]  and  found  applications  in  computer-intensive  inference 
procedures  such  as  the  bootstrap  [Dixon  et  ah,  2005].  With  models  having 
multiple  predictors  or  parameters,  such  as  the  T-frailty  model,  use  of  partial 
likelihood  implies  that  parameters  outside  the  partial  likelihood  still  need 
to  be  estimated  after  EM  converges,  which  makes  the  traditional  frailty-EM 
approach  very  inefficient  numerically. 

The  idea  of  this  paper  is  to  build  semiparametric  models  by  operation 
of  composition  of  7s.  Indeed,  since  an  NTM  generating  function  7  has  the 
domain  [0, 1]  and  range  in  the  same  interval,  a  composition  of  any  number 
of  such  functions  is  again  an  NTM  generating  function.  Let  7*(x  |  6t),  i  = 
1,  2, . . .,  be  NTM  generating  functions  for  a  set  of  basis  models.  In  this  paper 
we  study  models  built  as 


7 i(x  1 0*)  °  7j0  1 0j)  =  7 %  (7 j(x  1 0j)  | Qi)  =  7 ij(x  1 0i,6j),  (6) 

from  any  two  sumbodels  7 \  and  7 and  show  that  QEM  is  applicable  to 
any  compound  model.  The  composition  techniques  will  be  motivated  by 
frailty  models.  We  will  show  that  it  generalizes  Aalen’s  compound  Poisson 
device  based  on  Discrete-Continuous  compounding  [Aalen,  1992,  Moger  et  ah, 
2004]  and  give  examples  of  compound  frailty  models.  We  extend  the  device 
to  arbitrary  (discrete  or  continuous)  frailty  submodels.  We  then  consider 
composition  for  NTMs  and  discuss  identihability  issues.  Finally,  we  apply 
the  composition  technique  to  real  data  and  study  asymptotic  properties  of 
the  profile  likelihood  MLEs  by  simulations. 
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2  Discrete-Continuous  composition  device 

This  section  is  a  brief  review  of  existing  model-building  methodology.  We 
will  use  it  in  the  semiparametric  setting  to  construct  the  PHPH  model  in 
section  6.2.1. 

Heterogeneity  has  been  a  popular  tool  of  extending  survival  models.  Sev¬ 
eral  authors  considered  variations  of  the  so-called  Proportional  Hazards  (PH) 
frailty  model  [Hougaard,  1984,  Aalen,  1992,  Klein,  1992,  Nielsen  et  ah,  1992], 

G(()  =  E{F(()1'},  (7) 

where  G  is  a  population  survival  function,  F  is  the  baseline  survival  function, 
and  U  is  a  nonnegative  random  variable  (frailty).  Hougaard  [1984]  observed 
that  (7)  can  be  written  as  a  Laplace  transform  of  U 

G(t)  =  Cjj  {H{t)}  ,  Luis)  =  E  {e~sU}  ,  (8) 

where  H  =  —  log(F)  is  the  baseline  cumulative  hazard. 

The  Laplace  transform  connection  (8)  was  used  by  Aalen  [1992],  Moger 
et  al.  [2004]  to  build  a  family  of  models  induced  by  a  compound  Poisson- 
Gamma  distribution  for  U.  The  family  was  used  in  the  parametric  setting 
with  H(t)  specified  according  to  Weibull  distribution. 

Tsodikov  et  ah  [2003]  used  the  idea  as  an  instrument  to  build  semipara¬ 
metric  survival  models  induced  by  compound  Discrete-Continuous  frailties. 
The  frailty  random  variable  U  was  regressed  on  covariates  z,  so  that  U{/3 ,  z) 
is  considered  as  a  response  variable  in  a  parametric  regression  model,  where 
(3  is  a  vector  of  regression  coefficients.  The  parameters  of  the  distribution  of 
U  were  thought  of  as  regression  predictors  depending  on  covariates  and  re¬ 
gression  coefficients.  Let  zv(/30,  z)  be  a  discrete  nonnegative  random  variable 
with  the  distribution  with  parameter  vector  6  =  0(/3g,z)  and  the  Laplace 
transform  Cq.  Let  be  i.i.d.  copies  of  a  random  variable  £(/3  ,  z)  parameter¬ 
ized  through  r/i/3  ,z)  with  the  Laplace  transform  Crj-  Then  the  compound 
distribution 

o 

U((3,  z)  —  tkiP^z),  5Z  =  0,  (9) 

k=  1  1 

has  the  Laplace  transform 

ce,r]  =  ce{~l°  gA?}.  (io) 
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In  view  of  (8)  and  (10),  the  compound  survival  function  generated  by  (9)  has 
the  form 

G(t\f3,z)=Ce{-logCri(H(t))}:  (11) 

where  {3  =  ((3e,(3  ).  In  the  semiparametric  context,  H  is  treated  as  an 
infinite  dimensional  parameter,  a  step-function.  As  a  result  of  the  above 
procedure  a  new  hierarchical  model  G  =  Cq  ^(H)  is  constructed  so  that  it 
combines  features  of  two  submodels  G  =  Cq{H)  and  G  =  Crj{H). 

3  PH  mixture  model  vs.  NTM 

Following  [Wassel  and  Moeschberger,  1993,  Clayton  and  Cuzick,  1985a]  who 
considered  frailty  variables  dependent  on  covariates,  we  may  write  a  general 
univariate  PH  mixture  model  as 

G(t\(3,z)  =  E^F(t)u(f3'z)  z}.  (12) 

This  model  can  be  considered  a  generalization  of  the  so-called  PH  frailty 
model,  or  a  PH  model  with  a  random  effect 

G(t  \(3,z)  =  E{F(t)0&'z)v},  (13) 

where  9  is  a  predictor,  and  V  is  a  random  variable  independent  of  the  covari¬ 
ates,  considered  by  [Hougaard,  1984,  Klein,  1992,  Nielsen  et  ah,  1992]  and 
many  other  authors,  for  different  distributions  of  V. 

We  can  make  the  following  important  observations  about  the  class  of  PH 
mixture  models  (12).  The  survival  function  (12)  is  built  by  composition 

G(t\(3,z)  =  (jo  F)(t\fi,z),  (14) 

where  r){x\  f3,z)  belongs  to  the  class  V  of  probability  generating  functions 
(p.g.f.).  Here  we  extend  the  use  of  p.g.f.  to  arbitrary  nonnegative  random 
variables  and  define  any  such  p.g.f.  p  as  p(x)  =  C  {—  log(x)},  where  C  is  the 
Laplace  transform.  Covariates  enter  7  e  V  through  the  parameters  9  of  the 
distribution  of  U  as  they  are  turned  into  regression  predictors,  typically  as 
6  =  (exp{/3 Jz}, . . . ,  exp{/3^z})T,  k  =  dim(0).  Note  that  p.g.f.  thus  defined 
is  an  NTM  generating  function  (see  Introduction),  so  that  V  C  Af,  and  PH 
mixture  models  (12)  are  a  subclass  of  NTMs  (1).  Since  p.g.f.  is  an  analytic 
function,  the  complement  of  the  class  of  frailty  models  to  the  NTM  class 
includes,  for  example,  models  with  a  non-existent  derivative  7^  for  some 
k>  2. 
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4  Imputation  operator  and  the  meaning  of 
nondecreasing  0 

Suppose,  we  have  an  observation  (t,  z,  c )  sampled  from  the  PH  mixture  model 
under  independent  censoring,  where  t  is  an  observed  survival  time  and  c  is 
a  censoring  indicator  (c  =  0  if  t  is  a  censored  survival  time,  and  c  =  1  if 
t  is  a  failure).  Then,  under  the  PH  mixture  model  (12),  the  conditional 
expectation  of  U,  given  the  observed  event  ( t,z,c )  is  given  by  [Tsodikov, 
2003] 

E  {[/(•)  1 1,  -,  c}  =  (0  o  F)(t  |  -,  c)  =  ©  [F(t)  |  •,  c] , 

where  the  function  0  is  given  by  (3).  For  brevity,  we  use  (•)  to  suppress 
covariates  and  regression  coeffitients  /3,  z.  While  0  in  the  Introduction  is 
defined  for  NTMs,  we  also  consider  the  p.g.f.  subclass  7  G  VdJ\f  as  a 
motivation  and  to  better  understand  the  conditions  that  make  the  NTM- 
QEM  tandem  work. 

Cauchy-Schwartz  inequality  can  be  used  to  show  that  for  any  mixture 
model  7  G  V,  0  [x  |  •,  c]  is  nondecreasing  in  x  for  any  c  =  0, 1.  The  nonde¬ 
creasing  character  of  the  function  0  in  the  above  statement  is  quite  natural. 
The  longer  the  subject  stays  event-free,  the  lower  the  subject’s  posterior  risk, 
represented  by  0.  So  Q{F(t)  |  •,  c}  must  be  a  nonincreasing  function  of  t  for 
both  failure  (c  =  1)  and  censoring  (c  =  0)  events.  Since  the  survival  function 
F(t)  is  nonincreasing  in  t,  0(x  |  •,  c)  must  be  nondecreasing  in  x.  It  is  interest¬ 
ing  to  note  that  the  population  hazard  function  for  a  heterogeneous  popula¬ 
tion  under  the  PH  mixture  model  is  expressed  as  A (t  \  z)  =  Q{F(t)  |  *,  0 }h(t), 
where  h  is  the  hazard  function  corresponding  to  F.  Even  if  h(t)  is  increasing, 
the  observed  population  hazard  function  may  be  a  decreasing  one  through 
the  decreasing  behavior  of  ©{ W(t)  | - ,  0}  with  time.  This  observation  repre¬ 
sents  a  selection  effect  of  the  risk  set  becoming  “healthier”  with  time,  as  frail 
individuals  leave  the  population.  This  effect  was  discovered  and  extensively 
studied  in  demography  [Vaupel  et  ah,  1979]  in  the  context  of  misinterpreta¬ 
tion  of  mortality  trends. 

With  7  representing  a  PH  mixture  model  7  G  V,  kth  moments  of  the 
mixing  variable  U,  k  =  1,2,...,  can  be  obtained  through  derivatives  7^. 
Both  0  and  QEM  are  defined  using  the  derivatives  up  to  second  order  of  7, 
k  —  1,2.  Based  on  the  above  observations,  NTM-QEM  tandem  is  defined 
to  follow  second-order  properties  of  the  Frailty-EM  frame.  This  is  all  that 
is  needed  to  ensure  the  EM-like  behavior  of  the  QEM,  and  existence  of  all 
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derivatives  of  7  (still  a  weaker  assumption  than  that  of  a  frailty  model)  is 
excessive  for  purposes  of  statistical  inference. 

As  discussed  in  [Tsodikov,  2003],  the  property  of  non-decreasing  0  repre¬ 
sents  a  generalized  form  of  Jensen  inequality  on  the  primitive  class  of  func¬ 
tions  necessary  to  handle  the  QEM  algorithm. 

In  addition  to  being  a  non- increasing  function  of  time,  the  posterior  risk 
E  {U (•)  1 1,  •,  c}  for  PH  mixture  models  (7  G  V)  has  the  following  two  natural 
properties. 

1.  Other  things  equal,  the  posterior  risk  of  a  failure  is  at  least  as  high  as 
a  posterior  risk  of  a  censored  subject  E  {U (•)  1 t,  •,  1}  >  E  {U (•)  1 t,  •,  0}. 
This  statement  is  valid  in  the  general  NTM  form  (see  proposition  be¬ 
low). 

2.  Since  a  censored  observation  at  time  t  =  0  does  not  contribute  any 
information  on  the  risk,  posterior  risk  for  t  =  0,  c  =  0  is  the  same  as 
prior  risk  E  {U (•)}.  Expressing  the  mean  of  U  through  its  p.g.f.  76P, 
we  have  E  {[/(•)  |  0,  -,0}  =  E{U}  =  7'(1 1-)- 

Proposition  4.1  Surrogate  of  posterior  risk  for  NTM. 

Let  |',  c),  be  the  function  defined  by  (3)  and  induced  by  some  NTM  gen¬ 
erating  function  7,  given  an  event  ( t ,  -,c)  observed  on  a  subject.  Then 

(A)  If  Q(x  |-)  is  a  non- decreasing  function  of  x,  then 

&(F(t)  | •,  1)  >  @(F(t)  | -,  0)  >  0  (15) 

(B)  If  j  G  V  is  a  p.g.f.  of  some  nonnegative  random  variable  U ,  then 

E{U\t,-,l)}>E{U\t,-,0)}>0  (16) 

E{U  |  0,  - ,  0) }  =  E{U  |  •}  =  7'(1 1  •)  (17) 

The  proof  is  given  in  the  Appendix  A.l.  The  graph  of  typical  behavior  of  the 
posterior  risk  is  given  in  Figure  1  based  on  the  real  data  example  considered 
in  Section  8.1. 


Distant  Stage 


0  100  200  300 


Time  to  event  (months) 

Figure  1:  Posterior  risk  Q(F(t)  \  (3,  z,  c) 
by  type  of  event  (failure,  c  =  0  and  cen 
cal/Regional  and  Distant) 
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Relative  Risk 


Local/Regional  Stage,  baseline 


Time  to  event  (months) 


as  a  function  of  time  to  event  t 
:oring  c  =  1),  and  Stage  (z)  (Lo- 


5  Model  building 

5.1  Composition  with  PH  mixture  models 

The  Discrete-Continuous  composition  device  described  in  Section  2  is  quite 
restrictive.  It  covers  only  a  specific  subclass  of  frailty  models.  For  exam¬ 
ple,  should  we  try  to  build  a  hierarchical  model  from  two  submodels,  each 
generated  by  a  continuous  frailty  variable,  the  compound  device  (11)  would 
fail.  With  composition  defined  on  the  level  of  missing  data  (9)  (random  vari¬ 
ables  used  to  construct  U),  it  is  not  clear  how  compositions  other  than  of 
a  Discrete-Continuous  type  could  be  defined.  In  this  section  we  extend  the 
device  of  Section  2  to  arbitrary  frailty  models  by  defining  the  composition 
on  the  level  of  transforms. 

Returning  to  Discrete-Continuous  composition  described  by  (9),  observe 
that  in  terms  of  p.g.f.  (10)  can  be  written  as 

7  9,v  =  7  9°  In,  (18) 

where  70  corresponds  to  a  discrete  random  variable. 

Now  let  70  and  7^  be  two  p.g.f.  of  some  arbitrary  nonnegative  ran¬ 
dom  variables,  parameterized  through  predictors  6((3e,z)  and  re¬ 

spectively.  These  functions  generate  two  PH  mixture  models  of  the  form 
G(t  |  (3.,z)  =  7 .(F(t)  |  (3.,  z).  The  following  proposition  shows  that  the  com¬ 
pound  expression  7 e,n(E  |  •)  is  again  a  PH  mixture  model. 

Proposition  5.1  Composition  for  mixture  models. 

Let  7 e  and  7^  be  some  two  p.g.f.  7e»(^|-)  =  E(xv\-),  7^ (a: | -)  =  E(xt\-), 
where  v  and  f  are  some  independent  nonnegative  random  variables.  Then 
the  compound  function  7y?;  =  70  o  7 ^  is  also  a  p.g.f.,  meaning  that  there 
exists  a  nonnegative  random  variable  U  such  that  7e»,r;(^|')  =  E(xu  |  •). 

The  proof  is  given  in  the  Appendix  A. 2. 

The  above  result  shows  that  the  PH  mixture  subclass  of  NTM  is  closed 
with  respect  to  composition  of  NTM  generating  functions,  and  consequently, 
the  self-consistency  equation  (2)  defines  an  EM  algorithm  serving  the  new 
compound  model.  Therefore,  convergence  of  (4)  for  the  compound  model 
follows  from  the  general  EM  theory  [Dempster  et  ah,  1977,  Wu,  1983]. 
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5.2  NTM  composition  device 

In  this  section,  we  extend  the  composition  techniques  for  the  PH  mixture 
model  (Section  5.1)  to  the  NTM  class. 

Now,  let  70,7,7  G  Af  be  two  NTM  generating  functions  with  predictors  9 , 
and  77,  respectively,  not  necessarily  from  the  p.g.f.  subclass.  Then 

7(®I0  =  (70°7*?)(z|O  (19) 

is  a  new  NTM  semiparametric  model  with  two  predictors  6  and  rj.  If  70  (a:  |  •)  = 
x  for  some  value  of  6  (usually  for  9  =  1),  then  the  model  (19)  includes  models 
70  and  7,7  as  nested  special  cases.  The  following  statement  shows  that  the 
NTM  class  with  non- decreasing  Theta  is  closed  with  respect  to  composition. 

Proposition  5.2  Composition  chain  rule  for  NTM. 

Let  70  G  A f  and  7,,  G  J\f  be  some  two  NTM -generating  functions,  each  sat¬ 
isfying  the  assumption  of  nondecreasing  0,  where  0  is  given  by  (3),  and  let 
7 0i??  =  70  o  7^  be  the  compound  function.  Let  @a  be  the  Q-function  (3)  cor¬ 
responding  to  7a,  for  any  a.  Then 

(A) 

®e,r,(x  |  ■,  c)  =  @,,(2;  |  •,  0)  {(©e  o  7„)  (x  |  •,  c)  -  c}  +  c0„(2;  |  •,  c),  (20) 

where  c  =  0, 1  and  (0  o  7)  [x  |  •,  c)  is  understood  as  @{7(2;  |  -)|-,  c};  and 

(B)  The  compound  function  ©0^(2;  |-)  derived  from  the  compound  NTM- 
generating  function  70^  is  nondecreasing  in  x. 

Proof  is  given  in  the  Appendix  A. 3. 

Equation  (20)  represents  a  chain  rule  for  0  for  compound  models  and 
simplifies  derivation  of  compound  0  through  direct  use  of  0s  corresponding 
to  submodels  participating  in  the  composition. 

Also,  operation  of  composition  (19)  preserves  the  property  of  nondecreas¬ 
ing  0  discussed  in  Section  4.  Therefore,  convergence  of  the  QEM  algorithm 
(4)  for  the  compound  model  follows  from  the  results  presented  in  [Tsodikov, 
2003], 

6  Examples  of  models 

6.1  Basic  submodels 

In  this  section  we  present  some  popular  models  with  one  predictor  that  will 
be  used  as  a  basis  to  generate  compound  models  in  the  next  section. 


11 


6.1.1  PO  model 


Semiparametric  PO  model  G  (a  survival  function)  can  be  defined  as  the  one 
with  log  odds  ratios  of  survival  independent  of  time  and  an  arbitrary  baseline 
survival  function  F.  It  has  long  been  known  that  the  proportional  odds  (PO) 
model  has  frailty  interpretation  [Clayton  and  Cuzick,  1985b],  As  we  will  see 
later,  the  above  definition  identifies  a  family  of  frailty  models  with  infinitely 
many  possible  frailty  distributions. 

First,  consider  interpretation  of  the  PO  model  as  a  geometric  frailty 
model.  Let  F  be  the  baseline  survival  function  corresponding  to  the  co¬ 
variate  vector  z0  such  that  0(/3,  z0)  =  1.  For  the  PO  model,  the  predictor 
9(/3,  z)  has  the  meaning  of  odds  ratio  of  an  observation  with  covariate  vector 
z,  relative  to  the  baseline.  The  PO  assumption 


Odds{G{t\P,z)} 

Odds{F(t)} 


0(/3,z). 


(21) 


where  Odds(a) 
where 


a/(  1  —  a),  yields  the  PO  model  of  the  form  G 


704) 


9(-)x 
1  —  0(-)x’ 


7  °  F, 

(22) 


and  a  =  1  —  a  for  any  a.  To  invert  the  transform  (22),  we  expand  it  in  a 
Taylor  power  series  about  x  —  0.  We  have 


k=  1 


We  note  that  the  coefficients  of  the  power  series  represent  geometric  probabil¬ 
ities,  given  #(•)  <  1,  and  therefore  7(x|-)  =  E(a;17),  where  U  is  geometrically 
distributed. 

Now,  let  us  derive  the  interpretation  of  the  PO  model  as  an  exponential 
frailty  model.  Consider  a  PO  model  of  the  form 


G(t\[3,z) 


0(P,z) 

9((3,z)  +  H(ty 


7(4) 


fl(-) 

d(-)  —  logo: 


(23) 


where  FI  is  some  nonparametrically  specified  baseline  cumulative  hazard.  As 
with  the  PO  model  (22),  for  any  two  values  of  the  predictor,  9i,  92,  and  the 
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corresponding  survival  functions  Gi(t)  =  G(t\6i),  i  =  1,2,  the  odds  ratio 
derived  from  (23) 

Odds{Gi(t)}  _  e1 
Odds{G2(t)}  “  F2 

is  a  constant  in  t,  and  the  PO  assumption  is  satisfied.  For  the  PO  model  in 
the  form  (23),  we  have 


Ce{H)=  7(e-*|-) 


go 

0(-)  +  H- 


This  is  the  Laplace  transform  of  an  exponential  distribution  with  parameter 
<?(•).  Therefore  in  this  case  U  follows  an  exponential  regression  model  with 
E  U  =  0({3,z)-\ 

While  it  is  generally  believed  that  frailty  distributions  are  identifiable 
[Ebbers  and  Ridder,  1982],  the  observation  that  the  two  different  frailty  dis¬ 
tributions  lead  to  the  same  PO  model  illustrates  an  important  point  of  non- 
identihability  of  frailty  distributions  in  semiparametric  frailty  models.  In 
fact,  /3s  representing  the  log-odds  ratios  in  both  models  (23)  and  (22)  are 
exactly  the  same.  This  non-identihability  is  due  to  the  fact  that  a  semipara¬ 
metric  model  is  defined  as  a  class  when  we  say  that  the  baseline  survival 
function  is  arbitrary.  This  class  can  be  represented  in  an  infinite  number  of 
equivalent  forms.  The  models  (23)  and  (22)  use  two  different  forms  of  an 
arbitrary  baseline  survival  function.  Indeed,  if  F(t)  is  an  arbitrary  survival 
function,  so  is  (1  —  log  F(f))W  Obviously  the  choice  of  this  form  along  with 
the  parameterization  of  7  determines  the  distribution  of  the  frailty  variable 
as  the  inverse  transform.  We  will  return  to  this  discussion  in  Section  7. 

In  order  to  specify  the  estimation  algorithm,  we  need  to  derive  the  pos¬ 
terior  risk  function  for  the  model.  Using  (3)  and  (23),  we  get 


Q(x  |  •,  c) 


c+  1 

9(-)  —  logo;’ 


(24) 


6.1.2  PH  models 

The  traditional  PH  model,  later  referred  to  as  the  Proper  PH  model, 

G(t  |  /3,  z)  =  F(t)e^z)  (25) 

does  not  need  an  introduction.  Direct  use  of  (3)  leads  us  to  the  posterior  risk 
function  of  the  form 

e*(a;|-,c)  =  0(-).  (26) 
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Note  that  since  0  does  not  depend  on  the  infinite  dimensional  part  of  the 
model  F,  the  QEM  procedure  and  the  self-consistency  equation  (2)  reduce  to 
the  Nelson- Aalen-Breslow  estimator  for  the  baseline  hazard  in  the  PH  model. 

Along  with  (25),  we  consider  a  PH  model  with  cure  (Improper  PH  model). 
While  it  is  sometimes  believed  that  the  chance  of  cure  defies  proportional 
hazards  and  implies  a  mixture  model  of  the  form  G  =  p(/3,  z )  +  p(/3,  z)F, 
where  p  is  the  probability  of  cure  [Kuk  and  Chen,  1992],  this  is  not  the 
case.  The  PH  model  with  cure  was  motivated  by  the  intention  to  combine 
proportional  hazards  and  an  improper  survival  function  [Tsodikov,  1998].  An 
improper  survival  function  G(t)  implies  that  the  cumulative  hazard  has  an 
asymptote,  9,  as  t  — >  oo.  Any  such  hazard  can  be  represented  as  9(1  —  F(t )), 
where  F(t)  is  a  proper  survival  function.  Introducing  covariates  into  the 
parameter  6  =  Q(/3,  z )  leads  to  a  PH  model  with  an  asymptote 

G(t\(3,z)  =  exp{-9({3,z)[l  -  F(t)}}  .  (27) 

Expanding  the  NTM  generating  function  of  the  Improper  PH  model 

j(x)  =  exp{0(l  —  x)}  (28) 

in  a  Taylor  power  series  about  x  =  0,  we  obtain  a  power  series  with  Poisson 
probabilities  with  parameter  9.  Therefore,  (27)  is  a  Poisson  frailty  model. 
Again,  we  note  a  non-identifiability  issue  as  non-random  frailty  and  Poisson 
frailty  lead  to  the  PH  model  with  the  same  hazard  ratios. 

Using  (3)  we  get  the  posterior  risk  function 

©6>(x|  -,c)  —  c  +  9(-)x.  (29) 

6.1.3  Linear  Transformation  Models 

The  PO  and  the  PH  model  considered  above  are  members  of  the  so-called 
linear  transformation  model  (LTM)  family  defined  as  [Cheng  et  ah,  1995, 
1997] 

log  v(T\z)  —  —  log  9(z)  +  e,  (30) 

where  T  is  the  failure  time,  e  is  the  random  error  with  the  distribution  fi, 
and  v  is  some  unspecified  strictly  increasing  function.  For  the  exponential 
predictor  9(/3,z )  =  exp(/3Tz),  the  model  assumes  a  linear  form  in  covariates 
and  transformed  response.  The  connection  between  LTM,  the  PH  model, 
the  PO  model,  and  binary  regression  models  was  discussed  in  [Doksum  and 
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Gasko,  1990].  After  a  little  algebra,  a  linear  transformation  model  can  be 
represented  as  an  NTM  with  the  NTM-generating  function 

l(x\f3,z)=p{\ogd{f3,z)  +  q(x)},  (31) 

where  p  is  a  parametrically  specified  tail  function  (— oo,  oo)  — >  [0,1],  (=1- 
distribution  function),  and  q  is  an  inverse  tail  function.  It  is  convenient  to 
specify  q  as  the  inverse  of  p,  then  9  =  1  corresponds  to  the  baseline  y(a;|-)  =  x. 

6.2  Compound  models 

While  the  models  considered  below  were  introduced  on  a  case  by  case  basis 
and  motivated  by  various  non-systematic  considerations,  in  this  paper  we 
re-invent  them  using  composition  as  a  tool  to  illustrate  the  method. 

6.2.1  PHPH  Cure  Model 

This  model  extends  the  Improper  PH  model  (27)  by  introducing  a  PH  short¬ 
term  effect  on  the  normalized  baseline  cumulative  hazard  F  — >  F'n^,z\ 

G(t  |  0,  z)  =  exp  {-SQ3,  z)[l  -  }  .  (32) 

Here  we  note  that  the  model  is  constructed  by  composition  (19)  of  NTM 
generating  functions  for  the  Improper  PH  model  (28)  and  the  Proper  PH 
model  7 v(x)  =  x7\ 

=  ^eix  |-)  °  jv(x  |-)  = 

[exp{— d(-)(l  —  x)}]  o  =  exp  {— 0(-)  (l  —  x }  .  (33) 

A  review  and  history  of  this  model  is  presented  in  [Tsodikov  et  al.,  2003]. 
Note  that  based  on  Section  6.1.2,  70  is  a  p.g.f.  of  a  Poisson  random  vari¬ 
able,  and  7 v  is  a  p.g.f.  of  a  nonrandom  variable.  Therefore  the  composition  is 
a  particular  case  of  Aalen’s  device  [Aalen,  1992]  (9)  with  v  being  Poisson(d), 
and  £  =  rj  being  nonrandom. 

Rather  than  compute  the  conditional  expectation  of  the  frailty  variable 
using  an  integral  over  the  compound  distribution,  we  can  use  the  chain  rule 
(20)  with  the  submodel-Qs  specified  by  (26)  and  (29)  and  immediately  get 

0(a;|-,  c)  =  9(-)rr](-)xr,^'>  +  cq(-).  (34) 
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6.2.2  T-frailty  model 


Now,  consider  a  model  composed  of  the  PH  and  the  PO  models. 

The  T-frailty  model  can  be  built  as  a  composition  of  the  NTM-generating 
functions  corresponding  to  the  PO  (??)  and  the  proper  PH  models  (25).  As 
a  result  of  the  composition  7  =  70  o  7^,  we  have 


G{*|  *(■), »?(■)} 


0(0 

0(0  +  H(t) 


(35) 


Indeed, 


70,i 


1 1  - 

r  0(0  1 

1')  - 

1.0(0 +*J 

v(-) 


is  the  Laplace  transform  of  a  T-distribution  with  scale  parameter  6  and  shape 
parameter  77,  and  we  have  the  interpretation  of  the  compound  model  (35)  as 
a  T-frailty  model. 

Note  that  since  an  exponentially  distributed  random  variable  correspond¬ 
ing  to  70  is  a  continuous  one,  the  above  composition  is  not  a  particular  case 
of  (9). 

The  compound  0  is  derived  from  the  chain  rule  (20) 


0(x  I  •,  c) 


V(-)  +  c 
d(-)  —  logo: 


(36) 


It  is  assumed  that  predictors  depend  on  (3,  z  via  the  form  p0  +  /3Tz,  where 
Po  stands  for  the  intercept  term  of  the  predictor.  Also,  different  predictors 
have  independent  sets  of  regression  coefficients  6  =  9(pgo  +  f3gz),  77  =  r}(Pn 0  + 
(3 f  z).  To  avoid  overparameterization  of  the  T-frailty  model,  the  intercepts 
are  hxed  at  zero.  Based  on  the  submodels,  /3e,  have  the  meaning  of  the 
log-odds-  and  log-hazards-ratio,  respectively. 

It  should  be  noted  that  the  direct  formulation  (35)  of  the  T-frailty  re¬ 
gression  model  by  making  shape  and  scale  parameter  into  linear  predictors 
offers  certain  advantages  as  compared  to  the  traditional  parameterization 
through  the  variance  v  of  the  frailty  variable  and  restricted  equal  shape  and 
scale  parameters.  With  the  traditional  model  v  =  0  corresponding  to  the 
PH  model  is  at  the  border  of  the  parametric  space  v  >  0.  The  traditional  T 
frailty  model  is  irregular  for  this  reason.  The  model  (35),  on  the  contrary,  is 
a  regular  one  for  any  (3.  We  will  confirm  this  observation  in  the  simulation 
study  by  showing  the  validity  of  standard  MLE  theory  for  (3  with  the  model 
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(35)  in  Section  8.2.  For  the  same  reason  we  prefer  (35)  to  the  Dabrowska 
and  Doksum  model  [Dabrowska  and  Doksum,  1988]  that  can  be  represented 
through  a  composition  7  =  7i/a  0  7e»  0  7a,  where  70  is  an  NTM-generating 
function  for  the  PO  model  in  the  form  (22),  7a  and  71  /a  correspond  to  the 
PH  model,  and  a  is  a  scalar,  independent  of  covariates 


7(x|  •) 


9{-)xa  1  ° 

l  —  9(-)xaj 


a  >  0. 


(37) 


The  above  model  becomes  the  PO  model  in  the  form  (22)  if  a  =  1,  and  it 
becomes  the  PH  model  in  the  limit  as  a  — >  0.  With  the  above  model,  the 
PH  assumption  corresponds  to  the  border  of  the  parametric  space  (a  =  0). 


7  Identifiability 

Given  the  combined  model  parameter  u  =  (/ 3,F ),  a  naive  definition  of  iden¬ 
tifiability  would  be 

G(t  \  u\ ,  z)  =t:Z  G(t  |  U>2 ,  z)  =>  LUi  =  U>2-  (38) 

However,  for  semiparametric  models,  (38)  does  not  hold.  Even  within  the 
NTM  class,  presentation  of  a  semiparametric  model  in  terms  of  an  NTM- 
generating  function  7  is  not  unique,  as  there  is  a  number  of  ways  to  represent 
an  arbitrary  monotonic  function.  Indeed,  expression  (31)  suggests  that  a 
transformation  p{q(F)},  where  p  ia  a  tail  function,  q  is  an  inverse  tail  func¬ 
tion,  and  F  is  an  arbitrary  survival  function,  is  again  an  arbitrary  survival 
function.  In  other  words,  for  any  model  7,  the  family  of  NTM  generating 
functions 

7(^1  •)  =  in°poq){x\  •)  (39) 

represents  the  same  semiparametric  model  for  any  p  and  q  as  defined  above. 

As  an  example,  let  us  represent  the  two  forms  (22)  and  (23)  as  members 
of  the  family  (39).  Specifically,  using  (39),  let  p  correspond  to  the  logistic 
distribution,  and  q  to  the  smallest  extreme  value  distribution 

p[x )  =  (1  +  e*)”1,  ] 9_1(a;)  =  log{Odds(a;)}, 

q~1{x)  =  exp{— exp(x)},  q{x)  =  log(— log(a;)). 
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Then,  using  the  NTM-generating  function  (22)  in  conjunction  with  (39)  and 
the  functions  p  and  q  as  defined  above,  we  obtain 


7(rl  •) 


*(•) 

0(-)  -  log(ar)’ 


(40) 


which  is  the  NTM-generating  function  corresponding  to  the  PO  model  in 
the  form  of  exponential  frailty  model  (23). 

As  a  consequence  of  the  above  observations,  frailty  distributions  are  not 
identifiable  unless  the  model  is  restricted.  Such  a  restriction  is  provided,  for 
example,  by  assuming  G  to  be  in  the  NTM  class,  and  fixing  the  parametric 
form  of  the  model-generating  function  7.  Then,  if  7  is  an  absolutely  contin¬ 
uous  distribution  function  on  [0,1],  then  7  is  a  strictly  increasing  function, 
and  F  is  identifiable,  given  (3,  as  F  =  7 -1(G  [•). 

There  is  no  universal  chain  rule  as  far  as  the  “parametric”  identifiability 
with  respect  to  / 3  is  concerned.  For  example,  while  a  composition  of  (Im¬ 
proper  PH) o (Proper  PH)  models  is  identifiable  as  a  PHPH  model  (Section 
6.2.1),  the  reversed  order  (Proper  PH)o(Improper  PH)  leads  to  an  nonidenti- 
fiable  Improper  PH  model  of  the  form  7(0;  [•)  =  exp{— 6rj(l  —  F1)}.  Additional 
restrictions  on  the  parametric  part  of  the  model  may  be  necessary  to  ensure 
identifiability.  For  example,  intercept  term  in  the  proper  models  is  restricted 
to  zero  to  eliminate  its  interaction  with  the  unrestricted  nonparametric  base¬ 
line  survival  function  F .  For  cure  models  built  by  composition  of  a  non-cure 
and  cure  model-generating  functions,  F  is  restricted  to  be  zero  at  the  last 
failure  [Taylor,  1995,  Tsodikov,  2002],  and  the  intercept  term  is  removed  from 
the  non-cure  submodel  predictor.  The  intercept  parameter  in  the  predictor 
of  the  cure  submodel  codes  for  the  baseline  cure  rate. 

Identifiability  of  the  PH  frailty  model  received  much  attention  in  econo¬ 
metric  literature  [Ebbers  and  Ridder,  1982,  Heckman  and  Singer,  1984,  Heck¬ 
man,  1991],  primarily  in  the  parametric  setting.  For  semiparametric  models, 
these  results  are  valid  under  similar  restrictions,  and  do  not  hold  in  the  (38) 
form.  Same  observation  applies  to  the  perceived  identifiability  of  shared 
frailty  distribution  from  marginals  in  the  bivariate  case.  Nonidentifiablity 
of  frailties  is  yet  another  argument  to  consider  modelling  on  the  frailty-free 
NTM  level. 

The  nonidentifiability  aspect  discussed  in  this  section  could  be  used  to 
our  advantage.  The  functions  7,  p  and  q  could  be  optimized  within  the 
family  (39)  to  maximize  the  speed  of  convergence  of  QEM  or  to  ensure  its 
applicability.  For  example,  QEM  is  not  applicable  to  the  PO  model  in  the 
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form  (22)  when  9  >  1  because  the  assumption  of  nondecreasing  posterior  risk 
function  does  not  hold.  At  the  same  time,  QEM  is  applicable  to  (23)  for  any 
9. 


8  Data  analysis 

8.1  Real  data  example 

As  an  example,  we  use  data  from  the  National  Cancer  Institutes  Surveillance 
Epidemiology  and  End  Results  (SEER)  program.  Using  the  publicly  avail¬ 
able  SEER  database,  39393  cases  of  primary  prostate  cancer  diagnosed  in 
Greater  San  Francisco  between  1973  and  2000  were  identified.  Prostate  can¬ 
cer  specific  survival  was  analyzed  by  stage  of  the  disease  (localized/regional, 
35230  patients,  vs.  distant,  4163  patients).  For  the  definition  of  stages  as  well 
as  for  other  details  of  the  data  we  refer  the  reader  to  SEER  documentation 
http: //seer,  cancer,  gov/. 

Two  basic  models  PH  (25)  and  PO  (23),  and  two  hierarchical  compound 
models  produced  by  compositions  of  PH  and  PO  model  generating  functions, 
T-frailty  model  (35),  and  the  PHPH  cure  model  (32),  were  applied  to  fit  the 
data.  Stage  of  the  disease  was  represented  through  two  indicator  dummy 
variables  combined  into  a  vector  z.  Local/Regional  stage  was  considered 
as  a  baseline  group  and  the  corresponding  regression  coefficient  restricted 
to  0  for  identihability.  Regression  coefficient  (3  for  the  distant  stage  codes 
for  the  difference  in  survival  between  the  two  stages  expressed  either  as  a  log 
hazards  or  log  odds  ratio,  dependent  on  the  type  of  model  generating  function 
where  it  is  used.  The  basic  models  have  one  predictor  9((3,z )  =  exp (/3z), 
where  z=Indicator( “Distant  stage”).  Compound  models  have  two  predictors, 
9(/3g,  z)  =  exp(/3gz)  and  ?7(/3r7,  z)  =  exp (P^z)  coding  two  hazard  ratios,  long¬ 
term  effect  and  short-term  effect,  respectively,  in  the  PHPH  cure  model, 
and  odds  ( 9 )  and  hazard  (rj)  ratios  in  the  T-frailty  model.  In  the  latter 
model,  odds  and  hazard  ratio  predictors  have  the  interpretation  of  the  scale 
and  shape  parameter  of  the  frailty  distribution,  respectively.  Regression 
coefficients  in  the  PH  model  (/ 3g )  and  the  PH  submodels  of  the  PHPH  (/3g,  [3V) 
and  T-frailty  models  (/ 3V )  measure  the  disadvantage  of  being  in  the  distant 
stage  relative  to  local/regional  stage  as  a  relative  risk.  Regression  coefficient 
in  the  PO  model  (/ 3g ),  and  the  one  in  the  PO  submodel  of  the  T-frailty  models 
measure  the  difference  from  an  opposite  point  of  relative  odds  of  survival. 
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Since  risk  and  odds  of  survival  are  opposites  (high  risk  is  bad,  high  survival 
is  good),  these  coefficients  are  expected  to  be  of  opposite  signs  for  in  the  PO 
and  the  PH  model  fitted  to  the  same  data. 

Observed  (Kaplan-Meier)  and  expected  model-based  estimates  of  the 
survival  functions  by  group  are  shown  in  Figure  2. 

Parameter  estimates  and  confidence  intervals  are  shown  in  Table  1. 


Model 

Parameter 

Point- 

estimate 

Confidence 

interval 

p- Value 

PH 

/ % 

2.380 

(2.328,2.432) 

<0.001 

PO 

Pe 

-3.086 

(-3.162,-3.011) 

<0.001 

PHPH 

Improper  PH:  f3g 
Proper  PH:  f3v 

1.065 

1.788 

(0.923,1.207) 

(1.620,1.956) 

<0.001 

<0.001 

T-frailty 

PO:  p0 

PH:  A, 

-3.369 

-0.179 

(-3.580,-3.158) 

(-0.301,-0.057) 

<0.001 

<0.001 

Table  1:  Parameter  estimation  and  hypothesis  testing  for  prostate  cancer 
data  based  on  PH,  PO,  PHPH  and  T-frailty  models.  Negative  f3  in  the  PO 
effect  and  positive  f3  in  the  PH  effect  correspond  to  worse  survival  and  vise 
versa. 

Confidence  intervals  and  hypotheses  testing  is  based  on  the  inverse  of  the 
observed  profile  information  matrix 

d  %.(/3) 
pr  d/3d/3T  ’ 

where  the  profile  likelihood  £pr  is  given  by  (5).  Outlined  in  the  Appendix  A. 4 
are  the  main  results  that  lead  to  exact  computation  of  Ipr,  [Tsodikov  and 
Garibotti,  2005]. 
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Survival  n  Survival 


Proportional  hazards  model 


Proportional  odds  model 


Figure  2:  Prostate  cancer  cause-specific  survival  by  stage.  Observed  (Kaplan- 
Meier)  and  expected  survival  curves  for  four  models. 
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Profile  Likelihood 


From  Figure  2  it  is  evident  that  P-frailty  model  provides  the  best  fit  to 
the  data.  The  PO  model  is  second  best.  Given  the  hierarchical  structure  of 
T-frailty  model,  its  goodness  of  fit  can  be  tested  vs.  the  PO  model.  This  is 
a  test  for  (3V  =  0  in  T-frailty  model,  and  it  results  in  a  significant  difference 
xf  =  7.50,  p  =  0.006.  The  deviance  with  all  other  models  exceeds  60,  and 
we  focus  on  the  T-frailty  model  as  the  best  choice  at  the  level  of  model 
complexity  considered  so  far.  We  could  have  tried  to  improve  on  the  fit  by 
using  compositions  of  three  or  more  submodels,  but  felt  that  the  improvement 
over  the  T  frailty  model  would  be  irrelevant  for  our  data.  All  models  indicate 
a  highly  significant  effect  of  stage  (p  <  0.0001),  which  is  a  trivial  conclusion 
in  this  case. 

The  validity  of  standard  maximum  likelihood  theory  as  applied  to  the  T- 
frailty  model  (35)  will  be  studied  by  simulations  in  the  next  section.  As  the 
first  observation,  in  Figure  3  we  show  that  the  form  of  the  profile  likelihood 
£w  in  regression  coefficients  /3V  (log  hazards  ratio)  and  (3g  (log  odds  ratio)  is 
remarkably  quadratic.  In  the  next  section  we  will  verify  by  simulations  that 
the  curvature  of  the  profile  likelihood  surface  leads  to  consistent  estimates  of 
the  standard  errors  of  (3. 


A 

-39810 

"O 

1  -39820 

A 

/  \ 

;§  -39830 

/  \ 

/  \ 

CD 

1  -39840 

1  \ 

CL 

-39850 

\ 

-0.6  -0.4  -0.2  0.0  0.2 

Log  Hazard  Ratio 


-4.0  -3.5  -3.0  -2.5 

Log  Odds  Ratio 


Figure  3:  Profile  likelihood  as  a  function  of  regression  coefficients  sampled 
around  the  MLE  point. 
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8.2  Simulations 

We  begin  by  fitting  a  parametric  T -frailty  model  (35)  to  the  prostate  cancer 
data  with  the  baseline  survival  function  specified  as  Weibull  distribution.  The 
fit  (not  shown)  is  very  similar  to  the  semiparametric  version  of  the  model, 
and  the  parameter  estimates  are  as  follows,  f3g  =  —3.454,  /3V  =  —0.215,  and 
[median  of  F] =265.571,  [shape  of  F] =1.491. 

Each  simulation  experiment  was  replicated  1000  times.  Four  sets  of  exper¬ 
iments  were  generated  with  samples  sizes  of  100  to  1000.  Shown  in  Figure  4 
are  normal  probability  plots  for  the  components  of  (3  —  (/3g,  (3V)T.  As  evident 
from  the  figure,  small  sample  size  may  be  associated  with  some  departure 
from  normality  of  MLEs,  however,  with  a  sample  size  larger  than  300  the  esti¬ 
mates  look  perfectly  normal.  Shown  in  Table  2  are  the  results  of  simulations 
evaluating  bias  and  variance  of  the  estimates.  Empirical  means  of  (3  show 
good  correspondence  to  the  true  parameter  values  used  to  simulate  the  data 
and  are  within  the  margin  of  error  expected  from  1000  replicates.  Empir¬ 
ical  standard  errors  Sn{/3}  estimated  from  replicated  regression  coefficients 
are  in  excellent  correspondence  with  the  En{erg},  the  empirical  mean  of  the 
replicated  Jpr-based  estimate  of  standard  errors.  The  precision  of  variance 
estimation  improves  rapidly  with  the  sample  size. 

9  Discussion 

Recent  years  have  seen  an  explosion  of  new  survival  models  and  model- 
specific  estimation  procedures.  Mostly,  new  models  are  formulated  on  an 
ad-hoc  basis  and  not  much  methodology  is  available  to  guide  us  on  the  model 
choice  and  appropriate  estimation  procedures.  Although  challenges  of  mod¬ 
ern  survival  analysis  will  likely  keep  the  business  of  model  choice,  estimation, 
identifiability  and  asymptotics  largely  on  a  case-by-case  basis  for  some  time, 
there  is  a  continued  effort  to  automate  the  process.  This  paper  presents  yet 
another  step  towards  the  goal  by  recognizing  a  mechanism  of  how  models  can 
be  cloned  while  ensuring  that  the  descendants  are  served  computationally  by 
a  common  Nonparametric  Maximum  Likelihood  Estimation  framework.  We 
used  frailty  models  and  some  associated  compounding  techniques  as  a  moti¬ 
vation  for  the  method  of  this  paper  and  offered  a  way  to  build  hierarchical 
families  of  semiparametric  models  that  can  be  used  to  reproduce  complex 
patterns  of  covariate  effects  using  more  than  one  predictor  and  to  test  model 
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Expected  Normal  Expected  Normal  Expected  Normal  Expected  Normal 


Parameter 

E  n0} 

s  n{p} 

E„{crfl} 

Sn{dg} 

Sample 

size 

PH:  fig 
PO:  Pr, 

-3.168 

0.117 

1.394 

0.725 

1.451 

0.700 

0.415 

0.481 

100 

PH:  pg 
PO:  A, 

-3.478 

-0.113 

0.741 

0.308 

0.744 

0.304 

0.072 

0.058 

300 

PH:  pe 
PO:  Pv 

-3.352 

-0.159 

0.535 

0.220 

0.553 

0.228 

0.034 

0.026 

500 

PH:  pe 
PO:  Pv 

-3.433 

-0.197 

0.392 

0.158 

0.391 

0.157 

0.018 

0.012 

1000 

Table  2:  The  results  of  computer  simulation  to  verify  asymptotic  properties 
of  profile  likelihood  based  MLEs. 
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assumptions. 

Oftentimes  when  we  discover  that  one  of  our  standard  models  is  wrong 
for  the  data,  heterogeneity  in  the  form  of  a  random  effect  is  introduced 
to  model  the  departure  from  the  basic  model.  This  strategy  is  associated 
with  building  models  on  the  level  of  missing  data  and  it  requires  a  large 
amount  of  analytical  work  to  specify  the  algorithms.  If  our  goal  is  to  come 
up  with  a  suitable  model  for  the  data  rather  than  to  build  the  model  on 
mechanistic  premises,  the  heterogeneity  instrument  is  neither  convenient  nor 
necessary.  Non-frailty  framework  such  as  NTM-QEM  discussed  in  this  paper 
offers  streamlined  model  building  and  specification  of  inference  algorithms 
with  minimal  analytic  effort. 

While  rigorous  evidence  of  identihability,  consistency  and  efficiency  is 
still  spotty,  we  will  have  to  resort  to  simulations  in  studying  the  asymptotic 
properties  of  estimation  procedures.  Construction  of  a  general  computational 
and  model  building  framework  could  stimulate  targeted  efforts  to  develop 
rigorous  asymptotic  theory  for  certain  classes  of  models. 

The  non-identihability  aspect  discussed  in  Section  7  is  quite  intriguing. 
A  transformation  (39)  as  well  as  other  alternative  parameterizations  of  the 
model  generating  function  do  not  amount  to  a  re-parameterization  of  the 
model  and  invariant  MLEs.  While  regularity,  asymptotics,  convergence  and 
other  properties  of  inference  procedures  change  under  such  transformations, 
the  semiparametric  model  stays  the  same.  This  poses  an  interesting  question 
of  optimizing  inference  procedures  over  the  functional  class  of  mo  del- invariant 
transformations. 

Interesting  issues  for  future  research  include  extensions  of  the  principles 
presented  in  this  paper  to  multivariate  survival  models  and  time-dependent 
covariates. 
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A  Appendix 

A.l  Proof  of  Proposition  4.1.  Properties  of  the  surro¬ 
gate  of  posterior  risk. 

Observe  that 


••())  =  —  [0(x|-,l)  -  0(x|-,O)]  >  0, 

as  O  is  a  non-decreasing  function  of  x.  Also,  since  7  is  strictly  increasing, 

e(x|-,Q)  =  aiogJ(l|')>Q. 

OX 

The  above  two  expressions  imply 

0(x  | -,  1)  >  G(x  |-,  0). 

This  proves  (15).  Now,  observe  that  for  PH  mixture  models,  0(cc|-,c)  is  the 
conditional  expectation  of  frailty  random  variable  U,  given  observed  event. 
This  interpretation  of  (15)  gives  (16),  and 

E  (C7  |  0, -,  0)}  =  Y(1 1  -)- 

Since  for  PH  mixture  models  7  is  a  p.g.f., 

E{C/|.}=V(1|-). 

Combining  the  above  two  expressions  gives  (17). 

A. 2  Proof  of  Proposition  5.1  The  class  of  mixture  mod¬ 
els  is  closed  with  respect  to  composition 

By  the  Bernstein  theorem  (Feller  [1971]),  we  need  to  prove  that  7(e_s|-)  = 
(70  °7»?)(e_s|')  is  a  completely  monotonic  function.  Let  ^.(s)  =  7. (e— s | -) .  We 
have  'ijj(s)  =  ^  {— log^^s)}.  For  any  functions  £  and  £,  the  composition 
£  o  £  is  completely  monotonic  if  £  is  completely  monotonic,  £  >  0,  and  £'  is 
completely  monotonic.  Applied  to  the  functions  £>,  this  means  that  we  have 
to  prove  that  for  any  completely  monotonic  function  £>(s)  >  0,  the  function 
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f(s )  =  {—  log^(s)}'  is  completely  monotonic.  It  can  be  proved  by  induction 
that 

(-1  )n/ln)M  = 

where  a01  =  1,  an+i,i  =  a„i,  a„+i)fc  =  a„fc(n-/i:  +  2)  +  ariifc_i,  k  =  2,..,  ,n  +  l, 
an+i!n+2  =  an,n+i)  n  =  0, 1 ,  -  •  • .  From  the  above  equations  it  follows  that 
anfc  >  0  for  any  n,k.  Also,  i^(s)  >  0,  s  >  0,  and  since  ^  is  completely 
monotonic,  (— l)kip(k\s)  >  0.  Therefore,  (— l)n/^(s)  >  0,  s  >  0.  End  of 
proof. 


A. 3  Proof  of  Proposition  5.2.  Composition  chain  rule. 

Proof  of  first  statement  is  a  straightforward  exercise  in  differentiation  of  com¬ 
pound  functions  entering  (3).  Validity  of  second  statement  follows  from  (20) 
upon  observation  that  all  components  of  (20)  are  nondecreasing  functions  in 
x.  End  of  proof. 


A. 4  Profile  information  matrix 

Let  the  vector  h  represent  a  set  of  jumps  of  the  baseline  cumulative  hazard  H . 
Implicit  differentiation  of  the  profile  likelihood  yields  the  following  expression 
for  the  profile  information  matrix 


Ipr  —  Ipp  +  hTpIhhhf3  +  tiplhp 


+  Ihphp, 


(41) 


where 


f  dh 
=  80 


P=P 


and 


dH(P,  h) 
dadbT 


with  a  and  b  equal  to  /3  or  h. 

Notice  that  Ipr  has  dimension  d  x  d,  d  —  dim(/3).  Therefore  only  a  small 
matrix  needs  to  be  inverted  in  order  to  get  an  estimator  of  the  covariance 
matrix  of  regression  coefficients. 

The  difficulty  in  (41)  is  that  since  h(/3)  is  defined  implicitly,  so  is  the  po¬ 
tentially  large  Jacobian  matrix  dh/d/3.  Therefore,  the  Jacobian  is  generally 
unavailable  in  a  closed  form  and  its  computation  is  the  crux  of  the  matter.  It 
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can  be  shown  [Tsodikov  and  Garibotti,  2005]  that  dh/d/3  satisfies  a  system 
of  linear  equations  with  a  special  structure. 


<B+D>lr‘w- 

where  D  be  the  diagonal  matrix  with  elements 


7  _  Dm 

UJm. 


{fir, 


,  m  —  1, . . . ,  d, 


R  =  (. Rml )  With  Rml  =  X"=max{mi}  «*»  where 


a*  =  ^  Q(.Fj|/3,Zjj,Cjj),  i  =  l,...,n, 

ieCiUDi 


Cj  and  T>i  is  a  set  of  censored  observations  and  failures  at  the  ith  time  point, 
respectively, 

Q(x  |  -,c)  =  =  ~{Q{X  |  •,  c)  -  c)(0(x  |  •,  c  +  1)  -  0(x  I  •,  c)), 

and  for  k  —  1, . . . ,  d, 


6(fc) 


E 

(«)eKi 


*90 (Fj  |  (3,  Zjj,  dj) 
d  (3k 


sr'  dO{Fi  |  (5,  Cjj )  | 


For  each  k  =  1 , ,n  the  vector  dh/ (5k  can  be  obtained  from  the  following 
Proposition  [Tsodikov  and  Garibotti,  2005]. 


Proposition  A.l  Let  D  be  an  n  x  n  diagonal  matrix  with  diagonal  ele¬ 
ments  di  ^  0,  i  =  l,...,n.  Let  R  =  ( Rki )  be  an  n  x  n  matrix,  with 
Rki  =  XjLma  x{ki}ai>  where  Oi,  i  =  1 ,  ...n  are  real  numbers.  Let  b  be  an 
n-dimensional  vector. 

Define  the  functions  <fk  ■  1R  — >  IP,  k  =  1, ...  ,  n  recursively  as 


Vn{y) 

Vk{y) 


bn  CLn 

- r-  -  j-y, 
tin  dn 

l 

dk 


n  n  l—l 

bk  ^  E aiV  +  E  E a^i{y) )  > 

i=k  l=k+ 1  i=k 


k  =  n  —  1, . . . ,  1, 
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for  y  in  M.  Let  (p  :  M 
let 


Then  the  solution 
dimensional  vector  x  - 


->•  M  be  the  function  given  by  <p(y)  =  Ylk=i  Pk(y)  and 

y=  _ M _ , 

1  +  ^(0)  -£(i) 

to  the  system  of  equations  ( D  +  R)x  =  b  is  the  n- 
■  (<Pi  (y),---,Tn(y))T- 
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Abstract 


For  semiparametric  models,  interval  estimation  and  hypotheses  testing  based  on  the  infor¬ 
mation  matrix  for  the  full  model  is  a  challenge  because  of  potentially  unlimited  dimension. 
Use  of  the  profile  information  matrix  for  a  small  set  of  parameters  of  interest  is  an  appealing 
alternative.  Existing  approaches  for  the  estimation  of  the  profile  information  matrix  are  either 
subject  to  the  curse  of  dimensionality,  or  are  ad-hoc  and  approximate  and  can  be  unstable  and 
numerically  inefficient.  We  propose  a  numerically  stable  and  efficient  algorithm  that  delivers 
exact  observed  profile  information  matrix  for  regression  coefficients  for  the  class  of  Nonlinear 
Transformation  Models  [Tsodikov,  2003] .  The  algorithm  deals  with  the  curse  of  dimensionality 
and  requires  neither  large  matrix  inverses  nor  explicit  expressions  for  the  profile  surface. 

Keywords:  profile  likelihood,  semiparametric  models,  information  matrix 
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1  Introduction 


In  semiparametric  models  the  parameter  is  partitioned  as  (/3,  H )  with  (5  a  low-dimensional 
parameter  of  interest  and  H  a  high-dimensional  nuisance  parameter.  For  example,  in  semi¬ 
parametric  regression  survival  models,  f3  is  the  vector  of  regression  coefficients  and  H  is 
the  baseline  cumulative  hazard  function  estimated  as  a  step-function  by  the  Nonparametric 
Maximum  Likelihood  Estimator  (NPMLE).  The  dimension  of  H  is  given  by  the  number  of 
distinct  failure  times  and  increases  with  the  sample  size. 

Within  the  NPMLE  framework  the  following  tools  are  available  for  interval  estimation 
and  hypotheses  testing  for  f3. 

1.  Likelihood  Ratio.  The  likelihood  ratio  statistic  for  testing  H0  :  (3  =  (3q  is  defined  as, 

LR(PO)  =  2(£0,H)-£(PO,H(PO))), 

where  i  is  the  log-likelihood  function,  0,H)  is  the  NPMLE  of  ((3,H),  and  H({3)  is  the 
MLE  of  H  given  /3.  Although  classical  ML  theory  does  not  directly  apply  to  unlimited 
dimension,  for  many  semiparametric  models  LR  has  an  asymptotic  chi-square  distribution 
with  d  degrees  of  freedom,  where  d  is  the  dimension  of  /3.  A  (1  —  a)%  confidence  set  for 
/ 3  is  given  by 

{P  ■  LR(/3)  <  CdJ  , 

where  Cd,a  is  the  a  percentile  of  the  chi-square  distribution  with  d  degrees  of  freedom. 
When  the  asymptotic  distribution  of  LR  is  unknown,  bootstrap  can  be  used  to  approximate 

Cd,a- 

The  likelihood  ratio  approach  for  building  confidence  regions  for  (3  involves  inverting  the 
LR  surface,  which  is  quite  computer  intensive  as  repeated  maximizations  of  the  likelihood 
with  respect  to  H  are  required. 


2.  Wald  Statistic.  An  alternative  method  of  inference  for  f3  is  based  on  the  Wald  statistic 
defined  as 

W(/3)  =  (/3-/3)TE^(/3-/3), 

where  E^  is  the  /3-submatrix  of  the  inverse  of  the  observed  information  matrix 


I 


/  d2e(/3,H) 

82e(0,H)  \ 

l  d/3d/3T 

dpdHT  \ 

\  d2£(0,H) 

d2l(P,H)  1 

\  dHd0T 

8HdHT  J 

Note  that  in  the  presence  of  nuisance  parameters  the  information  matrix  needs  to  be 
inverted  twice  [Severini,  2000],  p.  121,  the  first  time  in  its  high- dimensional  full  model 
form  /,  and  the  second  time  as  a  dim (/3) -submatrix  of  E  =  /-1. 
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Under  certain  conditions,  W  is  asymptotically  equivalent  to  the  likelihood  ratio  and  has 
asymptotically  a  chi-square  distribution  with  d  degrees  of  freedom.  In  this  case, 

{P  :  W(P)  <  Cd,a}  , 

is  a  conhdence  set  of  approximate  coverage  probability  1  —  a. 

The  bottleneck  of  this  procedure  is  the  invertion  of  a  potentially  infinitely  large  matrix  I. 

The  two  methods  of  inference  on  f3  described  above  are  based  on  the  full  model.  An 
appealing  alternative  is  to  consider  the  so-called  profile  likelihood  [Murphy  and  van  der 
Vaart,  2000] 

£pr({3)  =  ma x£(P,H). 

H 

The  profile  likelihood  may  be  used  as  a  likelihood  for  (3.  The  MLE  for  /3,  the  first  component 
of  the  pair  (/3,  77)  that  maximizes  7(/3,  77 ),  is  the  maximizer  of  the  profile  likelihood  function 

iprW). 

Theoretical  justification  for  the  use  of  the  profile  likelihood  for  semiparametric  models  was 
given  in  [Murphy  and  van  der  Vaart,  2000,  van  der  Vaart,  1998,  Murphy  and  van  der  Vaart, 
1997].  It  was  shown  that  profile  likelihoods  with  nuisance  parameter  estimated  out  behave 
like  ordinary  likelihoods  under  regularity  conditions.  These  conditions  need  to  be  verified 
on  a  case  by  case  basis  as  the  general  theory  remains  a  challenge.  Theoretical  justiheation 
has  been  obtained  for  the  proportional  odds  (PO)  model  [Murphy  and  van  der  Vaart,  2000, 
Murphy  et  al.,  1997]  and  the  PH  frailty  models  [Murphy,  1994,  1995,  Parner,  1998,  Kosorok 
et  ah,  2004], 

The  observed  profile  information  matrix  will  be  denoted  7pr, 

_  dHpr(P) 

Pr  dm T  0=g  ' 

This  matrix  is  asymptotically  same  as  Eh] ,  and  summarizes  partial  information  on  (3. 
The  Likelihood  Ratio  and  Wald  statistics  based  on  £pr  are  easier  to  obtain  than  the  ones 
based  on  the  full  model  provided 

•  a  numerically  efficient  method  is  available  to  profile  out  the  nuisance  parameter  77,  and 

•  it  is  possible  to  derive  the  exact  observed  profile  information  matrix  or  estimate  it  in  a 
computationally  efficient  way. 


5 


Fulfilling  both  conditions  is  a  challenge.  First,  maximization  over  H  is  a  problem  of  poten¬ 
tially  very  large  dimension.  Second,  in  most  cases  £pr  cannot  be  differentiated  analytically. 
Several  alternatives  for  estimating  the  profile  information  matrix  have  been  proposed  in  the 
literature.  However,  they  are  all  approximations,  often  difficult  to  calibrate  in  practice,  and 
algorithms  to  obtain  them  are  computationally  costly. 

In  this  paper  we  propose  a  computationally  efficient  exact  solution  for  the  class  of  semi- 
parametric  Nonlinear  Transformation  Models  (NTM)  [Tsodikov,  2003].  The  basic  assump¬ 
tion  that  defines  this  model  family  is  that  the  survival  function  at  each  timepoint  t  is  a 
function  of  H(t)  mapping  real  numbers  [0,  oo]  — >  [0, 1]  rather  than  a  functional  mapping 
a  functional  space  to  [0,1].  In  other  words,  model-based  survival  function  is  obtained  by 
plugging  a  cumulative  hazard  H  or  a  baseline  survival  function  F  =  exp (-H)  into  a  suit¬ 
ably  defined  parametric  function  (so-called  model-generating  function,  see  Section  2).  Note 
that  a  similar  assumption  underlies  the  von-Mises  Calculus  [van  der  Vaart,  1998],  p.291. 
The  NTM  class  includes  the  proportional  hazards  (PH)  model,  univariate  PH  frailty  mod¬ 
els  ],  the  proportional  odds  model,  cure  models  such  as  the  PHPH  model  [Tsodikov,  2002, 
Tsodikov  et  ah,  2003].  A  numerically  efficient  Quasi-EM  algorithm,  a  subset  of  the  MM 
family  Lange  et  al.  [2000]  was  developed  to  obtain  the  maximum  profile  likelihood  for  NTM 
models  [Tsodikov,  2003].  The  algorithm  has  since  been  used  in  computer  intensive  settings 
such  as  the  bootstrap  [Dixon  et  ah,  2005]. 

The  algorithm  for  the  exact  Ipr  proposed  in  this  paper  works  under  the  following  two 
basic  assumptions. 

Independence  of  the  future.  Independence  of  the  future  means  that  the  contribution  to 
the  likelihood  of  an  observed  event  at  time  t  depends  on  the  past  H[0,t\  of  the  function 
H ,  but  not  on  the  future. 

Nonlinear  Transformation  Model  Assumption.  The  survival  function  given  covariates  is 
specified  as  a  parametric  transformation  of  H .  A  detailed  definition  is  given  below. 

We  compare  our  method  to  the  following  three  existing  techniques  used  to  estimate  the 
profile  information  matrix  that  amount  to  particular  forms  of  numerical  differentiation  of 
the  second  order. 

1.  Discretized  second  derivative.  Corollary  3  of  [Murphy  and  van  der  Vaart,  2000]  shows  that 
under  certain  conditions 

q  log  £pr(/3  T"  hnVn )  log  £pr(/3)  p  T  T  ^ 

n%  * V  prV’  [  ) 

P  j  P 

for  all  sequences  vn  ->«er  and  hn  0  such  that  (^Nihn)^1  =  Op(l). 
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This  result  can  be  used  to  derive  an  estimate  of  Ipr.  Note  that  this  method  requires 
careful  maintenance  of  the  speed  of  convergence  of  the  sequence  {hn}  as  the  condition 
(\Jnhn)-1  =  Op(  1)  implies  that  the  convergence  should  be  neither  too  slow  nor  too  fast. 
The  reason  is  that  the  precision  of  discrete  differential  operator  as  n  — >  oo  in  the  left  side  of 
(1)  needs  to  be  measured  against  the  convergence  of  MLEs  to  the  true  value.  Indeed,  under 
regularity  conditions,  the  asymptotic  expansion  of  the  likelihood  ratio  statistic  about  the 
MLE  (3  has  the  form  n(/3  —  /3)T  Ipr(/3  —  /3)  +  op(y/n  \\(3  —  (3*\\  +  1),  where  (3*  is  the  true  value. 
The  procedure  (1)  is  designed  to  extract  the  quadratic  term  by  setting  (3  =  (3  +  hnvn,  and 
by  ensuring  the  1/a fn  rate  of  convergence  of  (3  to  simultaneously  (3  and  (3 *  so  that  the 
quadratic  term  is  indeed  the  dominant  one.  Otherwise,  the  expansion  would  be  dominated 
by  its  op(l)  part  if  hn  is  too  fast  or  by  op(y/n  || (3  —  (3*\\)  if  hn  is  too  slow. 

See  Section  4  for  further  details  and  implementation  of  this  method. 

2.  Fitting  a  Quadratic  Form.  Asymptotically,  under  regularity  conditions,  the  profile  like¬ 
lihood  surface  around  the  true  (3  is  quadratic.  Nielsen  et  ah  [1992]  proposed  fitting  a 
quadratic  form  to  £ pr(/3 )  in  some  domain  around  the  maximum  likelihood  estimator,  j3, 
and  to  derive  an  approximate  profile  information  matrix  using  the  estimated  coefficients 
of  the  form.  Note  that  globally  the  likelihood  surface  is  not  quadratic.  The  quadratic 
approach  is  difficult  to  implement  as  a  sufficiently  small  domain  around  f3  where  the 
likelihood  surface  can  be  well  approximated  by  a  quadratic  form  is  not  well  defined.  Mis- 
specihcation  of  this  domain  with  the  quadratic  method  often  leads  to  estimates  of  the 
profile  information  matrix  that  are  not  positive  definite,  particularly  if  the  number  of  co¬ 
variates  is  large.  Yet  the  domain  needs  to  be  large  enough  to  ensure  adequate  precision 
and  sufficient  sample  size  representing  the  number  of  likelihood  evaluations  within  the 
domain.  This  balancing  act  is  notoriously  difficult  as  the  true  variance  is  unknown  and 
the  likelihood  surface  is  specific  to  the  data  set  being  analyzed. 

3.  Numerical  Differentiation  of  the  Profile  Likelihood.  Standard  numerical  algorithms  can  be 
used  to  numerically  differentiate  the  profile  likelihood  function.  We  use  Ridder’s  method 
[Press  et  ah,  1994]  in  the  examples  presented  in  Section  4.  The  difficulties  in  the  implemen¬ 
tation  of  this  idea  are  similar  to  the  ones  with  the  Quadratic  Form  approach.  Numerical 
differentiation  requires  choosing  a  tolerance  for  the  estimation  of  the  derivatives,  and  typ¬ 
ically  involves  interpolation  of  the  function.  The  precision  and  speed  of  these  methods  are 
in  inverse  relashionship  and  they  vary  widely  dependent  on  the  tolerance.  Since  likelihood 
surface  is  dataset-specific,  this  method  may  require  calibration  and  tuning  for  a  particular 
dataset. 

Approximating  nature  of  the  standard  approaches  outlined  above,  the  need  to  balance 

various  tradeoffs  in  their  implementation,  and  a  likely  need  to  tweak  implementation  based  on 
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the  dataset  at  hand,  makes  it  difficult  to  develop  these  approaches  to  the  point  of  automation 
sufficient  for  use  in  standard  statistical  software. 

The  algorithm  proposed  in  this  paper  is  exact,  automatic  and  requires  no  tuning  or 
calibration.  This  makes  it  an  attractive  alternative,  particularly  with  statistical  software 
applications  in  mind. 

The  PO  model  is  used  in  this  paper  to  compare  via  simulations  the  performance  of 
the  three  estimation  methods  for  Ipr  and  the  proposed  exact  algorithm.  For  different  sample 
sizes,  the  approaches  were  compared  in  terms  of  the  number  of  operations  required  to  achieve 
a  reasonable  standardized  precision.  Naturally,  the  exact  method  outperforms  any  approx¬ 
imating  method  if  an  ever  better  precision  is  demanded.  In  our  numerical  study  we  focus 
on  practical  precisions  where  approximating  methods  could  nevertheless  represent  a  viable 
competition  to  an  exact  procedure.  Numerical  efficiency  and  precision  of  the  computation 
of  Ipr  is  of  great  importance  for  variable  selection  procedures.  In  an  example  involving  7 
variables,  backward  variable  selection  using  the  Wald  statistic  based  on  the  exact  profile 
information  matrix  took  less  than  one  third  of  the  time  of  the  quadratic  approach.  We  also 
compared  the  estimation  methods  in  terms  of  relative  error.  Of  all  approximating  methods, 
the  numerical  approach  has  the  smallest  relative  error. 

As  a  result  of  these  studies  we  believe  that  the  exact  method  should  be  the  primary 
choice  for  Nonlinear  Transformation  models. 


2  Nonlinear  Transformation  Models 

Nonlinear  transformation  models  (NTM)  are  defined  as  follows  [Tsodikov,  2002,  2003]. 

Definition  1  Let  j(x  |  /3,  z)  be  a  parametrically  specified  distribution  function  with  x-domain 
of  [0, 1] .  Let  F(t )  be  a  nonparametrically  specified  baseline  survival  function.  A  semipara- 
metric  regression  survival  model  is  called  a  Nonlinear  Transformation  Model  if,  conditional 
on  the  covariates  z,  its  survival  function  G  can  be  represented  in  the  form 

G(t  \P,z)  =  7(F(t)  \(3,z).  (2) 

The  function  7  is  called  the  NTM- generating  function. 

Note  that  F(t)  =  exp (—H(t))  where  H{t )  is  the  baseline  cumulative  hazard  function.  With 
this  in  mind  we  can  write  the  hazard  function  of  the  model  as 

(3) 


where  h(t)  =  H'(t )  is  the  baseline  hazard  function. 

In  [Tsodikov,  2003]  a  Quasi-EM  (QEM)  point  estimation  algorithm  for  the  NTM  was 
developed  and  conditions  that  ensure  its  convergence  were  given. 

The  algorithm  solves  a  functional  self-consistency  score  equation  of  the  form  H  =  ip ((3,  H) 
for  H,  where  ^  is  a  mapping  that  generalizes  a  Nelson- Aalen-Breslow  estimator  for  the 
proportional  hazards  model  so  that  its  denominator  depends  on  H  as  well  as  (3.  Functional 
iterations 

#(*+!)  =  ^(/3,if(*)),  k  =  l,2,...  (4) 

are  exercised  until  H,  the  fixed-point  of  if),  has  been  approximated,  H ^  — >•  H,  as  k  — >  oo, 
see  [Tsodikov,  2003]  for  details. 

Although  any  parameterization  of  7  in  terms  of  (3  and  z  is  allowed,  in  the  examples 
we  assume  that  7  is  parameterized  through  a  set  of  parameters/predictors  9,  77, . . .,  where 
each  predictor  is  further  parameterized  using  generally  different  sets  of  regression  coefficients 
Pi,  fo, . ,  so  that  6  =  exp(P^z),  rj  =  exp (/3Jz),  .... 

2.1  Profile  Likelihood  Approach 

Let  ti,  i  =  l,...,n  be  a  set  of  failure  times,  arranged  in  ascending  order,  tn+\  :=  00. 
Associated  with  each  is  a  set  of  subjects  V,;  with  covariates  27,,  j  E  T>i  who  fail  at  ti ,  and 
a  set  of  subjects  Ct  with  covariates  ztj,  j  E  Ci  who  are  censored  at  time  t  E  [ti,U+ 1).  The 
observed  event  for  the  subject  ij  is  a  triple  (ttl  Zij,  c^),  where  c  is  a  censoring  indicator,  c  =  1 
if  failure,  c  =  0  if  right  censored.  Let  H  be  the  baseline  cumulative  hazard,  with  H( 0)  =  0. 
We  assume  than  H(t)  is  a  step  function  with  jumps  at  the  failure  times  U,  i  =  1, . . . ,  n.  As 
a  step-function,  H  can  be  characterized  by  the  vector  h  =  (hi,  ...,hn),  where  hi  =  A  Hi  is 
the  jump  of  H  at  fj.  With  this  notation,  under  an  NT  model  (2),  (3)  and  non-informative 
censoring,  the  likelihood  of  survival  data  takes  the  form 

n  n 

z  =  A  log  (hi)  +  loS'W:  I  /3,  Zij,  Cij ), 

i=  1  i= 1  j€CiUT>i 


where 


i3(x  |  (3 ,  z,  c )  =  x‘ 


,  dc'y(x  |  f3,  z ) 
dxc 


d0,y/dx0  =  7,  Dj  is  the  number  of  failures  associated  with  ti  and 


Fi  =  F(U)  =  exp(—  ^  hi). 

1=1 
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The  profile  likelihood  is  defined  as  a  supremum  of  the  full  likelihood  £  taken  over  the 
nonparametric  part  of  the  model 


ipr(/3)  =  ma x£(/3,  h). 

h 


The  MLE  of  h  for  a  given  (3  will  be  denoted  h(/3)  =  (hi, ... ,  hn),  with  hk  =  hk(P),  then 

epr(P)  =  £(p,h(P))- 

Differentiating  £  with  respect  to  h  and  setting  the  score  equal  to  0  we  obtain  h(/3)  as  the 
solution  of  the  functional  self-consistency  equation 


Dr 


'lP/(i,j)ETlm  I  Pi  Zij i  Cij ) 


m  —  1, . . . , n, 


(5) 


where  Ft  is  a  function  of  hi, ...  ,hi, 


Q(x\(3,z,c) 


d\ogrd(x\f3,z,c)  ^C+I\x\(3,z,c) 

dx  c  +  x  ^  ^  |  ^  ^ 


(6) 


and  7 Zm  is  the  set  of  subjects  at  risk  just  prior  to  tm,  7Zm  =  {(*,  j)  ■  i  >m,  j  G  C*  U  V, } . 


2.2  Point  estimation 


Point  estimation  proceeds  along  the  lines  of  the  following  nested  procedure, 

•  maximize  £pr(P)  by  a  conventional  nonlinear  programming  method,  for  example,  the  Powell 
method  [Press  et  ah,  1994], 

•  for  each  f3  demanded  in  the  above  maximization  procedure,  find  ma x£(/3,h)  as  the  fixed 

h 

point  of  (5). 

The  Quasi-EM  algorithm  makes  use  of  the  straightforward  recursion  to  obtain  the  profile 
likelihood, 


Dr 


©(exp(—  £!=1  h 


Pi  %ij  i  Cij  , 


k  —  1,2,...’  m  —  1, . . . ,  n, 


(7) 


where  k  counts  iterations.  Note  that  an  increment  of  k  occurs  only  once  all  the  parameters 
hi,  i  —  1, ,  n  have  been  updated. 

It  can  be  shown  that  if  0  is  nondecreasing  in  x,  each  update  of  H  using  (7)  strictly 
improves  the  likelihood,  given  /3.  This  guarantees  convergence  of  the  sequence  of  likelihood 
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values  £(P,h^)  to  £(/3,h(/3))  under  fairly  general  conditions  [Tsodikov,  2003];  that  is,  h(/3) 
is  the  fixed  point  of  the  recursion  given  in  (7). 

It  should  be  noted  that  the  proposed  information  matrix  algorithm  is  not  contingent  on 
using  a  specific  method  for  point  estimation.  Yet  it  builds  on  the  idea  of  the  self-consistency 
through  implicit  differentiation  of  the  self-consistency  equation. 


3  Profile  Information  Matrix 


The  profile  information  matrix  is  the  observed  information  matrix  derived  from  the  profile 
likelihood, 


4r(/3) 


d2£pr((3) 
d(3d(3T  ' 


Implicit  differentiation  of  the  profile  likelihood  yields  the  following  expression  for  the 
profile  information  matrix 


Ipr(P )  —  Ip/3  +  hTpIhhhi3  +  hTpIhp  +  I^hp  +  ^  £hm 


(8) 


m=  1 


where  h  =  h(/3 )  is  some  function  of  f3,  and 

d2£  „  de(P,  h ) 


h  _  9h(f3) 

hp~  d/3  ’  U~ 


dadbr 


i  5  ^rim 


hm 


d  hr 


,  and  hm.j/3 


d2hm{(3 ) 
d(3d(3T  ’ 


with  a  and  b  equal  to  /3  or  h. 

When  evaluated  at  the  MLE  h(/3),  where  h  is  a  function  defined  implicitly  as  the  solution 
of  the  score  equation 

4(/3,/i)  =  0  =>  h  =  h(P),  (9) 


the  information  matrix  simplifies  to 

Ipr  =  1/3/3  +  Ihphp- 

Indeed,  by  virtue  of  the  score  equation  (9), 

4(/3,h(/3))  =  0. 

Differentiating  (11)  with  respect  to  (3,  we  also  have 

deh{p,h{P)) 


d/3 


~  Ifi/3  +  ^hh^0  ~ 


(10) 


(11) 


(12) 
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with  (10)  now  following  from  (8)  on  substitution  of  (11)  and  (12). 

It  should  be  noted,  however,  that  unless  the  score  equation  (9)  is  solved  for  h  exactly , 
the  short  form  of  the  observed  profile  information  matrix  (10)  is  generally  not  going  to  be 
symmetric.  Except  in  the  Cox  model,  there  is  no  closed  form  solution  to  h,  and  this  function 
is  an  output  of  a  numerical  algorithm  such  as  (7)  converging  to  h  with  some  tolerance.  To 
preserve  the  symmetry  of  Ipr)  we  prefer  to  keep  some  of  the  theoretically  redundant  terms 
in  (8)  and  use  the  form 

UP)  =  bi>  +  hlhhk J  +  hike  +  (13) 


Notice  that  Ipr  has  dimension  dx  d,  d  =  dim(/3).  Therefore  only  a  small  matrix  needs  to 
be  inverted  in  order  to  get  an  estimator  of  the  covariance  matrix  of  regression  coefficients. 

The  difficulty  in  (13)  is  that  since  h(/3)  is  defined  implicitly,  so  is  the  potentially  large 
Jacobian  matrix  dh/d/3.  Therefore,  the  Jacobian  is  generally  unavailable  in  a  closed  form. 
The  success  in  the  calculation  of  the  profile  information  matrix  is  determined  by  the  existence 
of  an  efficient  numerical  method  to  compute  dh/df3.  Generally,  computation  of  dh/d/3  is  as 
difficult  as  taking  the  inverse  of  the  original  full  model  information  matrix  (0(n3)  operations 
required),  and  this  derivation  defeats  the  purpose.  However,  if  the  functional  i3(H,t  |-)  that 
defines  model  contributions  to  the  likelihood  depends  on  (H,  t)  only  through  H(t),  which  is 
the  case  for  the  NT  models  (2),  dh/d/3  can  be  obtained  by  solving  a  system  of  linear  equations 
with  a  special  structure.  This  specific  structure  of  the  linear  system  can  be  exploited  to  derive 
an  efficient  numerical  solution  given  in  Proposition  1. 

We  first  show  how  to  obtain  Igg,  Ikg  and  Ihh- 
The  H— score  of  an  NT  model  is, 


dt 

dhk 


®(Fi\P,Zij,Cij). 

k  0 ij)enm 


Differentiating  the  H- score  with  respect  to  f3  we  get, 


de 

dhkd[3m 


E 

(i,j)eKk 


de 

<9  An 


{Fi  I  A  ziji 


Evaluation  of  derivatives  of  0  or  7  with  respect  to  [3  depends  on  the  parameterization 
of  the  model’s  predictor  as  a  function  of  explanatory  variables  z,  which  is  model-specific. 
Once  a  model  is  specified,  the  calculation  of  Igg  and  Ikg  is  straightforward. 

Since  Ft  =  exp(—  Yu=i  we  have 

dQ(Fj  1  •)  =  (  Q(Fi  |  •),  m  <  i, 
dhm  \  0,  m  >  i, 
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where 


Q(x  |  •,  c)  =  —  =  ~(0(x  |  •,  c)  -  c)(0(x  I  c  +  1)  —  0(x  |  ■,  c)),  (15) 

and  stands  for  ',l/3,  z”\  Note  that  <9@(F,:  j  -)/d hm  is  a  constant  in  m  for  m  <  i  or  m  >  i. 
From  (14)  it  follows  that, 


dH 

dhkdhm 


£ 


(*J)^^max{/c,m} 


Q\Ri  |  /3)  %iji  Cij )  d"  ,2  l{fe=m}) 


where 

_  f  1,  k  =  m, 

{k=m}  \  0,  k  ±  m. 

From  this  we  get  Ihh- 

Now  we  turn  our  attention  to  the  Jacobian  d h/d/3.  Proposition  1  gives  the  main  result 
used  to  efficiently  calculate  d h/d/3  in  the  case  of  NT  models.  Its  proof  is  given  in  the 
Appendix. 


Proposition  1  Let  D  be  an  n  x  n  diagonal  matrix  with  diagonal  elements  d,  0,  i  = 
1, . . . ,  n.  Let  R  =  (Rki)  be  an  n  x  n  matrix,  with  Rki  =  J2i=nmx{k  i}  ao  where  i  —  1, . . .  n 
are  real  numbers.  Let  b  be  an  n-dimensional  vector. 

Define  the  functions  (pk  '■  K-  — >  M,  k  —  1, . . . ,  n  recursively  as 


V>n{y) 

<Pk(y) 

for  y  in  M.  Let  (p 


dn 

1 

dk 


CLn 

-  -ry, 

dn 

n  n  l—l 

bk-J2a^y+  am(y) 

,  i=k  l=k+ 1  i=k 

be  the  function  given  by  <p(y)  = 


=  n  -  1, . . . ,  1, 
(Pk(y)  and  let 


m 

1  +  ^(0)  -£(i)' 


Then  the  solution  to  the  system  of  equations  (D  +  R)x  =  b  is  the  n-dimensional  vector 

%=  (<pl  (y),---,<Pn(y))T- 
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We  now  show  that  the  Jacobian  <9 h/d/3  satisfies  a  relationship  of  the  form  as  discussed 
in  Proposition  1.  Differentiating  the  self-consistency  equation  (5)  implicitly,  we  get  that  h 
satisfies  the  relationship 


<9  hr 


h 


df3k  Dr 


^  Q(Fi  \P,Zij, 


l— 1  (*  J)e7 


.  <9 hi  \ — '  <90  .  , 

Cii)aA  +  ..2J 

(l,j)eTZm 


'13)  1 


where  Q  is  the  function  given  in  (15). 

Let  D  be  the  diagonal  matrix  with  elements 


din. 


Dr 


(hr 

Let  R  =  (. Rml )  with  Rml  =  where 


m  —  1, . . . ,  d. 


ai=  I  P’  %’  c*j)>  *  =  1,  •  ■  ■ ,  n 

ieCiUDi 


and  for  k  —  1, . . . ,  d  let 


(16) 


(,'*>  =  -  £ 

\  (*j')e7^i 

It  follows  from  (16)  that 


<90 (Fi  |  (3,  zi:j,  dj ) 


<9  Ac 


ij  l  cij )  _ 


<90  (A  |  A  Zij,  c 


IJJ 


<9  Ac 


dh 

dfdk 


=  - D ~l  |  A 


dh 
<9  Ac 


6(fc)  . 


Hence, 


Therefore,  for  each  k  =  1, . . . ,n  the  vector  dh/ (3^  can  be  obtained  from  Proposition  1.  We 
now  have  all  the  components  of  (13)  defined.  This  completes  the  exposition  of  our  method. 


4  Examples 

In  the  examples  we  compare  the  performance  of  four  methods  to  compute  the  observed 
profile  information  matrix.  A  brief  explanation  of  the  methods  and  details  on  how  they  were 
implemented  in  our  examples  are  given  below. 
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1.  Discretized.  The  estimation  is  based  on  the  result  of  Corollary  3  in  Murphy  and  van  der 
Vaart  [2000].  Under  certain  conditions 


-2 


log  £pr0  +  hnVn)  ~  log  lpr{(3)  P > 
nh'i 


V  Ip,-  V , 


(17) 


P  J  P 

for  all  sequences  vn  — >»£  W1  and  hn  — *  0  such  that  ( \/nhn)~l  =  Op(  1). 

In  order  to  estimate  all  the  elements  of  Ipr,  we  chose  v  —  ei,  i  —  1, ...  ,d  and  v  —  et  +  ej7 
1  <  i  <  j  <  d,  where  e*  are  Euclidean  basis  vectors. 

We  set  vn  =  v,  and  hn  =  10 /(y/nCk)  with  C  =  1.4  and  k  such  that  \f  (It)  /  (y/nCk))  — 
/(10/(v^nCfc_1))j  <  0.001,  where  f(h)  is  the  left  hand  side  of  equation  (17)  considered  as 
a  function  of  h.  This  procedure  was  motivated  by  Dixon  et  al.  [2005]  who  considered  a 
choice  of  hn  in  the  one  dimensional  (d  —  1)  situation. 

2.  Quadratic.  This  approach  approximates  the  profile  likelihood  surface  by  a  quadratic  form 
and  derives  the  estimate  of  the  information  matrix  from  the  coefficients  of  the  form  fitted 
to  the  surface.  Specifically,  let  A f3  be  a  vector  of  deviations  of  the  f3  values  sampled  in 
the  vicinity  of  /3,  and  let  Alpr  be  the  induced  vector  of  deviations  of  the  profile  likelihood 
from  its  maximum  value,  £pr(/3).  Then,  if  A (3  is  sufficiently  small 

A£pr  «  ^  A(3TIprA(3 . 

Fitting  the  quadratic  form  ( l/2)A(3TAA/3  to  points  (A/3,  A£w)  by  least  squares  produces 
an  estimate,  A,  of  the  profile  information  matrix  Ipr. 

In  our  implementation  of  this  method  we  limit  the  domain  to  points  that  are  not  rejected 
at  0.05  significance  level  by  the  LR  test  (applied  informally  and  disregarding  the  multi¬ 
comparison  issue).  In  other  words,  points  (3  are  included  if  —2  |/'pr(/3)  —  ^pr(/9)|  A  Cb,o.o5; 

where  U^o.os  is  the  0.05th  upper  tail  percentile  of  the  y2  distribution  with  d  =  dim(/3) 
degrees  of  freedom.  Since  the  validity  of  the  quadratic  approximation  is  itself  a  prerequisite 
for  the  validity  of  the  likelihood  ratio  statistic,  this  choice  is  far  from  perfect.  Yet  this 
procedure  would  ensure  a  desired  property  of  the  domain  shrinking  with  sample  size,  and 
we  know  of  no  better  alternative. 


3.  Numerical.  The  calculation  of  the  observed  profile  information  matrix  is  carried  on  using 
Ridder’s  numerical  differentiation  of  the  profile  likelihood  function,  see  Press  et  al.  [1994], 
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Let  /  :  M  ->  1  be  a  differentiable  function.  By  definition,  the  derivative  of  /  is  the  limit 
as  h  — >  0  of  the  incremental  quotient 

g(h)  =  /(rC  +  ^/W. 

The  basic  idea  of  Ridder’s  method  is  to  calculate  q(h)  for  several  values  of  h,  and  then 
extrapolate  the  result  to  the  limit  h  =  0.  In  the  case  of  a  function  with  domain  on  Md, 
the  vector  of  first  derivatives  is  obtained  applying  the  algorithm  on  each  coordinate  at  a 
time  and  leaving  the  other  coordinates  fixed. 

Numerical  experimentation  showed  that  this  approach  gives  the  second  derivatives  of 
£pr(/3 )  with  very  high  precision,  albeit  at  a  greater  computational  cost  than  other  methods. 

4.  Exact.  This  is  the  method  developed  in  Section  3  of  this  paper  for  computation  of  the 
exact  observed  profile  information  matrix  for  NTM. 

PO  model  will  be  used  as  a  basis  for  all  our  comparisons.  The  validity  of  NPMLE  and  the 
profile  likelihood  for  this  model  has  been  demonstrated  elsewhere. 


4.1  The  Proportional  Odds  Model 

Given  covariates  z,  the  survival  function  G(t  |  /3,  z)  of  a  PO  model  can  be  written  in  the 
form, 

c(mz)  =  G(t\m*))  =  e£^m,  us) 

where  H  is  some  nonparametrically  specified  baseline  cumulative  hazard  function,  and  6  is 
a  predictor.  Since  H  =  —  log-F,  the  NTM-generating  function  of  the  PO  model  is 

ry(x  I .)  = _ _ 

1  ’  0(0  -  logs' 


A  characteristic  feature  of  the  PO  model  is  that  for  any  two  values,  6 1,  02,  of  the  predictor, 
the  odds  ratio 


Odds (G(t  |  0i))  _  0i 
Odds(G(i  |  02))  “  e~2 


is  constant  in  t. 

It  follows  that 


@(s  |  •,  c) 


c  +  1 

0(- )  —  logx’ 
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We  consider  an  exponential  parameterization  of  the  predictor  8(/3,  z)  =  exp  (f3Tz).  With  this 
parameterization, 


80  a 


d2e 


=  ezzJ 


d(3d(3'1 

The  following  derivatives  of  0  are  necessary  to  specify  the  algorithm  of  Section  3, 


<90  <90, 

~d(3  ~  ~d9dZl 


o2e 

d/38/3'1 


902  °  +  96  ° 


ZZ 


where 


<90  (x  j  •,  c) 

Fe 


c+  1 

(0(0  -  logx)2’ 


and 


<92@(x  |  •,  c) 

W2 


2(c  +  1) 
(#(•)  -  logx)3 


4.2  Real  Data 

As  an  example,  we  use  data  from  the  National  Cancer  Institutes  Surveillance  Epidemiology 
and  End  Results  (SEER)  program.  Using  the  publicly  available  SEER  database,  11621 
cases  of  primary  prostate  cancer  diagnosed  in  the  state  of  Utah  between  1988  and  1999 
were  identified.  The  following  selection  criteria  were  applied  to  a  total  of  19819  Utah- 
cases  registered  in  the  database:  valid  positive  survival  time,  valid  stage  of  the  disease, 
age  >  18  years.  Prostate  cancer  specific  survival  was  analyzed  by  stage  of  the  disease 
(localized/regional  vs.  distant).  For  the  definition  of  stages  as  well  as  for  other  details  of 
the  data  we  refer  the  reader  to  SEER  documentation  http://seer.cancer.gov/. 

The  data  analysis  presented  in  this  paper  is  a  continuation  of  the  one  given  in  [Tsodikov, 
2003].  Two  groups  of  patients  representing  stage  at  diagnosis  of  the  disease  are  considered, 
hence  the  predictor  in  the  PO  model  has  a  single  parameter  (3.  The  log  odds  ratio  (3  measures 
the  disadvantage  of  being  in  the  distant  stage  relative  to  local/regional  stage.  The  QEM 
algorithm  was  applied  to  fit  the  PO  model  to  the  data.  The  maximum  likelihood  estimate 
of  (3  is  (3  —  —3.251.  Confidence  intervals  for  (3  were  obtained  using  the  Wald  statistic 
based  on  the  profile  information  matrix.  The  confidence  interval  based  on  the  quadratic 
approximation  of  the  profile  information  matrix  is  (—3.416,  —3.086)  and  the  one  obtained 
through  the  exact  profile  information  matrix  is  (—3.415,  —3.086).  Excellent  concordance  of 
the  two  confidence  intervals  is  due  to  the  large  sample  size  and  the  small  dimension  of  the 
regression  parameter,  a  situation  when  approximating  methods  tend  to  be  accurate. 

In  the  case  of  a  single  parameter,  the  observed  profile  information  matrix  is  a  scalar.  The 
estimates  of  the  observed  profile  information  matrix  are  142.1011,  141.2158  and  141.7424  for 
the  Discretized,  Quadratic  and  Numerical  approaches  respectively  and  the  Exact  value  is 
141.7423.  Although  the  values  are  quite  similar  it  is  clear  that  the  discretized  and  quadratic 
approaches  depart  from  the  true  value. 
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4.3  Simulations 

4.3.1  Simulations  setup 

We  simulated  age  at  diagnosis  for  an  adult-onset  disease  using  a  proportional  odds  model. 
The  example  in  its  baseline  survival  is  loosely  based  on  indicence  of  prostate  cancer.  The 
baseline  survival  function  was  assumed  to  follow  a  Weibull  distribution  with  the  median  of 
38  years  and  shape  of  1.8,  the  risk  starting  at  the  age  of  18  (a  fixed  number  of  18  years 
was  added  to  survival  time  and  censoring).  With  these  parameters  incidence  before  the 
age  of  40  is  negligible.  Independent  censoring  mechanism  was  assumed.  Censoring  times 
were  generated  using  Weibull  distribution  with  a  median  of  46  years  and  shape  parameter 
of  4.  Observations  in  excess  of  105  years  were  type-I  censored  at  105.  Two  covariates  were 
introcuded,  one  categorical  with  3  levels,  and  one  continuous  (a  risk  factor)  with  a  range 
between  -1  and  1.  Values  for  both  covariates  were  generated  independently.  The  continuous 
covariate  followed  a  uniform  distribution.  The  discrete  distribution  for  categorical  covariate 
assumed  the  following  probabilities,  0.7  (level  1),  0.5  (level  2),  and  0.1  (level  3).  The  following 
covariate  effects  were  assumed.  An  effect  of  the  log  odds  ratio  of  2  was  assumed  for  a  unit 
change  in  the  continuous  factor.  Categorical  covariate  was  assumed  to  have  a  progressing 
effect  on  the  risk  of  the  disease.  The  log  odds  ratios  comparing  level  2  and  level  3  to  the 
baseline  level  1  were  1.5  and  2.5,  respectively. 

4.3.2  Speed 

To  assess  the  speed  of  performance  of  the  four  methods  we  calculated  the  number  of  opera¬ 
tions  required  to  compute  the  exact  information  matrix  and  its  approximations.  Evaluation 
of  0,  7,  their  analytically  specified  derivatives  or  similar  comparable  procedures  were  counted 
as  one  operation.  Figure  1  shows  the  number  of  operations  by  sample  size  and  method.  In 
order  to  make  the  performance  results  comparable,  the  precision  of  estimation  algorithms 
was  calibrated  on  an  ad-hoc  basis  so  that  the  relative  error  of  the  three  methods  (Discretized, 
Quadratic,  Numerical)  was  approximately  the  same  (0.02).  For  any  method  A,  the  relative 
error  was  defined  as  \\Ipr(A)  —  Jpr(Exact)||  /  ||/pr(Exact)||,  where  Ipr(A )  is  an  estimate  of  the 
observed  profile  information  matrix  computed  using  method  A,  and  the  norm  is  defined  as 
the  sum  of  absolute  values  of  all  elements  of  the  matrix.  Regardless  of  the  sample  size,  the 
exact  calculation  outperformed  the  approximate  methods.  Inference  based  on  the  discretized 
second  derivative  requires  between  10  and  30  times  as  many  operations  as  the  exact  calcula¬ 
tion.  The  quadratic  approach  requires  between  60  and  200  times  as  many  operations  as  the 
calculation  of  the  exact  Ipr  matrix.  The  numerical  method  is  computationally  very  costly 
requiring  between  600  and  7000  as  many  operations  as  the  exact  approach.  However,  the 
numerical  approach  behaves  better  than  the  other  two  methods  in  terms  of  relative  error  as 
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Operations 


Performance 


Sample  Size 


Figure  1:  Operations  by  sample  size  characteristics  of  four  methods  of  computation  of  the 
observed  profile  information  matrix.  The  Exact  method  developed  in  this  paper  shows  the 
highest  numerical  efficiency. 
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shown  in  Section  4.3.3. 


4.3.3  Precision 

A  sample  of  size  500  was  used  to  find  the  smallest  possible  relative  error  of  the  method 
when  adjusting  the  different  parameters  involved  on  an  ad-hoc  basis.  The  best  relative  error 
achieved  by  the  Discretized  method  was  0.01  and  8.13  105  operations  were  required.  This 
number  was  0.013  for  the  quadratic  approach  with  5.32  106  operations  required,  while  the 
numerical  approach  achieved  a  relative  error  of  8  10~'  and  required  3.87  108  operations.  This 
example  shows  that  the  numerical  approach  is  perhaps  the  only  one  of  the  approximating 
methods  that  can  compete  with  the  exact  procedure  in  terms  of  precision  required  in  real-life 
analysis.  It’s  high  computational  cost  though  makes  it  a  poor  choice  for  variable  selection 
and  other  procedures  requiring  repeated  evaluations  of  Ipr . 

4.3.4  Statistical  properties 

Three  sets  of  experiments  were  performed  with  samples  of  size  100,  500,  and  1000.  For  each 
sample  size,  1000  simulated  samples  were  generated.  The  covariance  matrices  based  on  Ipr 
were  computed  for  each  sample  using  the  four  approaches  discussed.  The  mean  and  standard 
deviation  of  each  of  the  entries  of  the  estimators  of  covariance  matrices  under  study  were 
estimated  from  the  1000  replicates.  In  addition,  point  semiparametric  MLE  estimates  of  the 
three  parameters  entering  profile  likelihood  (log  odds  ratios  for  the  continuous  factor  and  level 
2  vs.  1  and  3  vs.  1  contrasts)  were  used  to  compute  the  empirical  covariance  matrix  based 
on  1000  replicates.  A  comparison  of  the  estimated  means  of  the  entries  of  the  covariance 
matrices  calculated  using  exact  and  numerical  approaches  with  the  empirical  ones  were  used 
to  evaluate  how  well  these  methods  estimate  the  true  finite  sample  variance-covariance.  The 
results  are  shown  in  Table  1.  Two  factors  are  contributing  to  the  distance  between  the 
exact  and  numerical  approaches  and  the  empirical  one:  the  finite-sample  bias  of  covariance 
estimates  based  on  Ipr,  and  the  bias  in  the  estimate  of  Ipr  by  an  approximating  method 
(this  latter  bias  does  not  pertain  to  the  exact  method).  The  following  basic  conclusions  are 
evident  from  the  Table  1. 

1.  All  methods  are  much  better  at  estimating  variances  (left  half  of  the  table)  than  covariances 
(right  half  of  the  table); 

2.  The  precision  of  estimation  of  covariance  improves  rapidly  with  sample  size; 

3.  Under  all  sample  sizes  the  numerical  approach  showed  excellent  concordance  with  the 
exact  method.  This  is  in  agreement  with  our  earlier  observation  that  of  all  approximat- 
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ing  methods,  the  numerical  approach  is  most  precise  albeit  computationally  costly.  The 
quadratic  method  showed  itself  as  least  precise  in  this  analysis; 

4.  For  reasonably  large  sample  sizes  all  methods  are  in  good  agreement  with  the  empirical 
estimate. 


5  Discussion 

In  this  paper  we  have  proposed  a  method  to  compute  profile  information  matrix  based 
on  implicit  differentiation  of  the  self-consistency  equation.  Computationally  the  method 
outperformed  all  existing  approaches  to  the  best  of  our  knowledge.  An  attractive  property 
of  the  procedure  is  that  it  is  exact  contingent  upon  point  estimates.  Even  though  exact 
point  estimates  are  hardly  ever  available,  the  precision  of  variance- covariance  estimation  is 
improved  as  the  method  does  not  add  any  error  to  the  one  associated  with  impresision  of 
point  estimates.  Numerically  efficient  and  stable  procedures  for  point  estimates  have  been 
developped  earlier  and  provide  a  good  complement  to  this  methodology.  We  recommend  the 
Exact  method  as  a  preferred  choice  with  Nonlinear  Transformation  Models. 

Since  derivatives  of  the  profile  likelihood  are  defined  implicitly,  applying  Newton- Raphson 
method  to  the  profile  likelihood  for  point  estimation  is  a  challenge.  The  Newton  Raphson 
typically  requires  exact  inverse  Hessian  matrix  and  is  not  guaranteed  to  converge  if  this 
matrix  is  approximated.  The  results  of  this  paper  can  be  used  to  provide  an  exact  inverse 
Hessian  matrix,  at  any  point  in  the  parameter  space  and  thus  enable  the  Newton  Raphson 
method  for  use  with  the  profile  likelihood. 
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replicates  for  the  four  methods  studies 
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6  Appendix 

Proof  of  Proposition  1 

The  equation  (D  +  R)x  =  b  implies 

n 

dk'Xk  +  ^  Rkixi  =  bk,  k  =  1, . . . ,  n.  (19) 

i=i 

Since  Rki  =  Y^=max{k,i}  a follows  from  (19)  that  for  k  —  1, . . . ,  n , 

n  k  n  n 

bk  —  dkXk  +  Xi  +  EE  diXi 

i=k  1=1  l=k-\- 1  i=l 

n  k  n  i 

=  te  +  I>5>l+  £  £  aiXl 

i=k  1=1  i=k-\- 1  l=k+ 1 

n  /  n  n 

i=k  \  1=1  l=i-\- 1 

The  second  equality  above  is  a  consequence  of  a  change  of  summation  order. 
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Hence,  solving  the  system  of  equations  (D  +  R)x  =  b  is  equivalent  to  solving  the  system 


^  /  n  n  l—l  \ 

xk  =  J-  [bk  -  OiV  +  yt  y ^  OjXi  ,  A;  =  1, . . . ,  n 

\  i=k  l=k+ 1  i=k  / 

n 

y  =  ^2xi- 

i=i 

Notice  that  {a;*,}  are  in  fact  functions  of  y  obtained  recursively,  xk  =  <pk(y), 
Y^=  i  </?&(?/)  =  ^(?/)-  The  solution  of  this  system  of  equations  is  the  vector  (<£>i(y), . . 
where  y  satishes  <^(y)  =  y.  Since  <fk,  k  —  1, . . .  ,n  are  linear  functions  of  y,  so  is 
<p(y)  —  y  =  ay  +  b  with 


Therefore, 


End  of  proof. 


a  =  (p{l)  —  1  —  </?(0)  and  b  =  </?(0). 
y=  _ M _ , 

1  +  ^(0)  -£(i) 


and  y  = 

■iVn(y))T 
(p.  Hence, 
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SUMMARY 

Introduction  of  screening  for  prostate  cancer  using  the  prostate-specific  antigen  (PSA)  marker  of  the 
disease  led  to  remarkable  dynamics  of  the  incidence  of  the  disease  observed  in  the  last  two  decades. 
A  statistical  model  is  used  to  provide  a  link  between  dissemination  of  PSA  and  the  observed  tran¬ 
sient  population  responses.  The  model  is  used  to  estimate  lead  time,  overdiagnosis  and  other  relevant 
characteristics  of  prostate  cancer  screening.  Copyright  ©  2006  John  Wiley  &  Sons,  Ltd. 

KEY  WORDS:  screening;  mixture  models;  prostate  cancer 


1.  INTRODUCTION 

Continuing  controversy  and  discussions  surround  the  issue  of  whether  prostate-specific  antigen 
(PSA)  screening  of  asymptomatic  men  can  be  linked  to  recent  decline  in  prostate  cancer 
mortality  [1-4].  Shown  in  Figure  1  is  the  age-adjusted  incidence  and  cause-specific  mortality 
curve  as  estimated  from  the  Surveillance,  Epidemiology  and  End  Results  (SEER)  [5]  database 
developed  and  maintained  by  the  National  Cancer  Institute.  While  the  incidence  curve  shows 
a  sharp  peak  with  the  introduction  of  PSA  testing  in  the  late  1980s,  mortality  showed  a  much 
less  dramatic  behaviour.  Advocates  of  screening  argue  that  screening  induces  a  favourable  shift 
in  the  distribution  of  stage  of  the  disease  at  diagnosis  and  that  earlier  detection  and  treatment 
should  lead  to  better  prognosis  and  reduce  mortality  from  the  disease.  The  difficulty  in  proving 
the  point  is  rooted  in  the  complexity  of  the  changes  that  screening  for  a  disease  with  high 
latent  prevalence  brings  to  the  observed  population  statistics.  Lead  time  and  overdiagnosis  are 


*  Correspondence  to:  Alex  Tsodikov,  Department  of  Public  Health  Sciences,  Division  of  Biostatistics,  University  of 
California  Davis,  One  Shields  Avenue,  Davis,  CA  95616,  U.S.A. 
tE-mail:  atsodikov@ucdavis.edu 
^E-mail:  aszabo@hci.utah.edu 

Contract/grant  sponsor:  National  Cancer  Institute  Cancer  Intervention  and  Surveillance  Modelling  Network; 
contract/grant  number:  U01  CA97414 

Contract/grant  sponsor:  Department  of  Defense;  contract/grant  number:  DAMD  7-03-1-0034 


Copyright  ©  2006  John  Wiley  &  Sons,  Ltd. 


Received  20  October  2004 
Accepted  1  April  2005 


A.  TSODIKOV,  A.  SZABO  AND  J.  WEGELIN 


Age-adjusted  Incidence  and  Mortality 


Figure  1.  Prostate  cancer  incidence  and  mortality  rates  by  year  of  diagnosis  age 
adjusted  to  U.S.  population  in  year  2000. 


among  the  key  characteristics  of  the  impact  of  screening  on  the  natural  history  of  the  disease 
that  can  be  identified  through  latent  variable  modelling  of  prostate  cancer  incidence. 

Lead  time  measures  an  advance  in  the  diagnosis  of  prostate  cancer  due  to  screening.  It 
adds  to  the  observed  survival  time  even  if  early  detection  and  treatment  were  of  no  benefit. 

A  large  proportion  of  prostate  cancers  identified  through  screening  would  never  be  detected 
in  the  absence  of  screening.  This  phenomenon  is  called  overdiagnosis.  Screening  brings  such 
cancers  to  the  surface  predominantly  in  the  localized  stage  of  the  disease  leading  to  an  apparent 
‘favourable’  stage-shift.  Overdiagnosis  has  multiple  consequences.  It  leads  to  over-treatment 
of  men  who  would  never  be  detected  without  screening.  Also,  it  modifies  apparent  estimates 
of  post-treatment  survival  as  over-diagnosed  cases  appear  to  be  ‘cured’.  Injection  of  over¬ 
diagnosed  cases  into  the  pool  of  all  prostate  cancer  presentations  at  diagnosis  changes  the 
distribution  and  the  meaning  of  clinical  covariates  in  men  diagnosed  with  prostate  cancer  in 
the  PSA  era.  With  the  introduction  of  screening,  the  prognostic  value  of  such  covariates  at 
diagnosis  is  modified.  For  these  reasons  the  prognosis  for  cases  diagnosed  in  the  screening 
era  is  markedly  different  than  for  cases  detected  naturally.  Inclusion  of  clinical  covariates 
into  the  model  may  be  misleading  and  requires  special  care  to  adjust  the  covariate  effects  for 
screening  patterns  in  the  population. 

Modelling  represents  an  important  tool  for  studying  screening  phenomena.  Mathematical 
and  simulation  models  have  been  used  for  inference  in  cancer  screening  trials  to  evaluate 
controlled  randomized  screening  interventions  [6, 7].  While  perhaps  providing  the  best  design 
for  evaluating  the  impact  of  screening,  such  randomized  trials  often  fail  to  respond  to  impor¬ 
tant  challenges.  Randomized  screening  trials  typically  require  decades  of  observation  in  order 
to  register  a  significant  effect  of  screening  on  cancer  mortality.  During  this  period  screening 
modality  is  often  outmoded  by  new  diagnostic  advances.  Considerable  changes  in  the  practice 
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of  management  of  the  disease  and  therapy,  often  concurrent  with  advances  in  screening  tech¬ 
niques,  make  it  difficult  for  screening  trials  to  catch  up  with  an  ever  evolving  scientific  and 
technological  progress  of  cancer  detection  and  treatment. 

A  number  of  simulation  and  analytic  models  have  been  designed  to  translate  the  results  of 
screening  trials  into  the  population  setting  [8—10,7].  Given  the  complexity  of  the  problem  at 
hand  no  single  approach  can  provide  the  final  answer.  In  particular,  relatively  small  populations 
used  in  screening  trials  run  the  risk  of  not  being  representative  for  the  national  population. 
Screening  patterns  operating  in  the  national  population  are  quite  different  from  the  artificial 
homogenized  schedules  pursued  in  screening  trials.  In  order  to  capture  the  complexity  of 
screening  in  a  population,  we  adopt  an  approach  of  estimation  and  prediction  of  the  effect  of 
cancer  screening  directly  from  population  data.  We  consider  screening  schedules  as  a  random 
point  process  in  the  population  and  use  characteristics  of  PSA  dissemination  to  inform  the 
model  about  the  properties  of  this  process.  Point  estimates  and  confidence  intervals  for  the 
model  parameters  are  based  on  the  maximum  likelihood  technique.  Population  databases  and 
cancer  registries  such  as  SEER  provide  a  unique  resource  for  studying  the  ‘ in  vivo ’  dynamics 
of  the  population  impact  of  screening. 


2.  CANCER  INCIDENCE  MODEL 


2.1.  The  basic  model 

We  use  the  classical  three-stage  model  of  the  natural  history  of  a  chronic  disease  [11].  Prostate 
cancer  is  a  result  of  an  irreversible  transition  of  the  disease  through  three  consecutive  stages: 
disease-free  stage,  pre -clinical  stage  and  clinical  stage.  The  time  spent  in  disease-free  stage 
is  characterized  by  the  age  Y  (a  random  variable)  at  onset  of  the  disease.  In  the  pre-clinical 
stage  disease  is  asymptomatic  and  can  be  detected  by  a  screening  test.  The  duration  of  the 
preclinical  stage  in  the  absence  of  screening  (a  random  variable)  is  termed  the  sojourn  time. 
If  undetected  by  screening,  the  disease  can  either  reach  the  clinical  stage  or,  alternatively,  the 
event  of  clinical  diagnosis  gets  right  censored  by  a  competing  risk  other  than  the  disease  of 
interest. 

It  is  clear  that  cancer  incidence  in  an  individual  is  a  convolution  of  two  generally  dependent 
survival  times.  In  what  follows  we  will  use  the  notation  X  for  a  hazard  function  (incidence 
rate),  /  for  a  probability  density  function  (p.d.f.),  and  G  for  a  survival  function  (s.f.)  unless 
noted  otherwise.  Let  Xo  be  the  hazard  function  of  Y  =  (age  at  disease  onset).  Denote  by  fo 
and  G0  the  corresponding  p.d.f.  and  s.f.  of  Y,  respectively. 

Prostate  cancer  incidence  X\ (a,t)  in  year  t  at  the  age  of  a  can  be  written  as  X\(a\t  —  a), 
where  X\(a\x)  is  the  hazard  function  for  cancer  diagnosis  at  the  age  of  a  for  a  person  bom 
in  year  x.  Clearly,  for  the  x -birth  cohort 


X\(a\x) 


f  iQI*) 

Gi(a\x) 


(1) 


The  functions  /)  and  G\  are  in  fact  represented  by  a  fairly  complex  mixture  model  which  we 
now  start  detailing. 
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Let  us  condition  on  the  age  at  tumour  onset  to  obtain  the  following  convolution: 

/i(40=  [  /iO  -  y\x,  y)fo(y\x)  &y  (2) 

Jo 

where  /)(c|x,  y)  is  a  conditional  p.d.f.  of  the  random  time  T  from  tumour  onset  to  its  potential 
diagnosis.  The  random  variable  T  represents  the  duration  of  the  latent  disease  stage.  Generally, 
y)  is  an  average  over  random  patterns  of  screening  operating  in  the  population.  It  is 
clear  that  T  is  a  result  of  two  dependent  competing  risks:  the  one  associated  with  natural 
clinical  diagnosis  through  symptoms  and  the  one  associated  with  detection  through  screening. 
Dependency  between  the  two  risks  is  a  consequence  of  natural  detection  and  screen-based 
detection  risks  sharing  the  same  disease  development  process  in  the  subject.  For  example,  the 
event  of  natural  detection  indicates  a  non-zero  risk  of  screen-based  detection  in  the  subject 
as  it  informs  us  that  the  onset  of  the  disease  has  already  happened.  This  dependency  will  be 
modelled  through  the  concept  of  shared  mixed  effect  (frailty)  [12]. 

In  our  first  approach  we  identify  the  onset  time  Y  with  the  shared  mixing  variable  and 
make  the  assumption  of  conditional  independence  of  potential  risks  of  natural  and  screen- 
based  detection,  given  Y.  Specifically, 

Gitf  \x,  y )  =  Gcdx^I*,  y)GSDx(£\x,  y)  (3) 

where  Gcdx  is  the  s.f.  of  time  to  clinical  diagnosis  (CDx),  the  sojourn  time  in  the  absence 
of  screening,  and  Gsdx  is  the  s.f.  of  the  potential  time  to  screen-based  diagnosis  (SDx).  Here 
£,  is  time  since  onset,  x  is  date  of  birth,  and  y  is  the  age  at  onset.  Note,  that  Gsdx  in  our 
model  corresponds  to  a  continuous  distribution  as  it  is  represented  as  a  continuous  mixture 
over  random  screening  schedules  in  the  population. 

The  density  fi{^\x,  y)  corresponding  to  cancer  diagnosis  given  birth  year  x  and  onset  time 
y  can  be  split  into  the  two  crude  densities  corresponding  to  the  two  modes  of  diagnosis: 
screening  and  clinical 

m\x,  y)  =  fSDx(Z\x,  y)Gc  dx(£|*,  y )  +  fcu^\x,  y)GSDx(£\x,  y)  (4) 

With  age  at  tumour  onset  y  integrated  out  of  (4),  we  obtain  a  similar  relationship  describing 
the  partition  of  the  observed  p.d.f.  f\{a\x)  into  the  two  crude  components  corresponding  to 
the  two  modes  of  detection 


/i(a|x)  =  /sDX(a|x)  +  /cdxOI*)  (5) 

where 


/sCDxOI*)=  [  /sdxO  -  y\x,  y)GCDx(a  -  y\x,  y)fo(y\x)  dy  (6) 

JO 

and 

/cdxOI*)=  [  fcv>x(a- y\x,y)GsTK(a- y\x,y)fo(y\x)&y  (7) 

Jo 

Note  that  the  crude  p.d.f.  /scDx(a|x)  is  a  function  of  age  of  the  subject  while  the  net  p.d.f. 
fsox(£\x,y)  conditioned  on  age  at  tumour  onset  is  a  function  of  the  age  of  tumour  j  =  a  y. 
Also,  fc  here  are  net  densities  with  respect  to  death  due  to  causes  other  than  prostate  cancer. 
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It  should  be  noted  that  the  assumption  of  conditional  risk  independence  is  not  essential. 
The  logic  of  this  paper  can  be  carried  forward  with  minor  changes  if  Gsdx  is  conditioned  on 
the  sojourn  time  or  on  a  more  complex  surrogate  of  natural  history  process.  However,  at  this 
point  we  stop  short  of  making  the  model  more  complex  and  make  this  and  other  refinements 
contingent  upon  compelling  evidence  from  the  data.  One  class  of  models  consistent  with  the 
independence  assumption  is  the  one  where  tumour  growth  is  assumed  to  be  deterministic,  and 
where  competing  risks  of  detection  are  independent,  given  the  tumour  growth  curve  [13]. 

The  distribution  for  a  non-negative  random  variable  can  be  represented  by  a  survival  func¬ 
tion  G,  a  hazard  function  A,  or  the  p.d.f.  /.  Dependent  on  the  situation,  we  will  use  the  most 
convenient  representation  and  keep  in  mind  that  other  characteristics  can  be  obtained  using 
the  well-known  relationships 

G(0  =  expj— jf  A«)d£}  =1  — jf/(OdS 

f(t\-Wl=_dGW 

71  ’  G(t )  At 


or  their  discrete  counterparts. 


2.2.  Modelling  cancer  detection  through  screening 

This  section  is  devoted  to  modelling  the  distribution  of  potential  time  to  screen-based  detection 
Gsox(£\x,y)  conditional  on  the  year  of  birth  x  and  age  at  tumour  onset  y. 

For  an  arbitrary  individual  from  the  target  population,  consider  the  ‘risk’  of  getting  the 
first  screen  in  his  life.  Age  at  first  screen  may  be  regarded  as  a  survival  time  with  the 
instantaneous  risk  represented  by  the  hazard  function  X\ s(a,  f).  Naturally,  AiS  depends  on  age 
a  of  the  person  and  the  current  calendar  year  t.  Generally,  it  is  expected  that  Ais(n,  0  increases 
in  t  starting  with  the  year  of  PSA  introduction.  As  a  function  of  a,  it  is  reasonable  to  expect 
that  Ais(a,  t)  is  increasing  initially  while  the  residual  life  expectancy  is  still  substantial  and 
then  decreasing  for  very  old  people.  An  empirical  histogram  estimate  for  Ais(a, 0  can  be 
obtained  by  dividing  the  number  of  subjects  at  the  age  of  a  receiving  their  first  screen  in 
year  t  by  the  total  number  of  subjects  with  no  evidence  of  the  disease  in  the  (a,  t )  cell.  More 
precisely,  we  should  count  tests  in  the  interval  (t,  t  +  dt)  and  divide  by  d t,  which  results  in 
the  same  estimate  for  the  grouping  interval  At  =  1  year.  Note  that  this  estimate  is  inconsistent 
unless  the  data  are  grouped  [14]. 

The  evolution  of  an  x-birth  cohort  up  to  the  age  of  a  can  be  represented  as  a  line  connecting 
points  (t,x  +  t),  where  t€  [0,a],  on  the  age  by  year  plane  called  the  Lexis  diagram  [15].  The 
probability  of  no  screens  by  the  age  of  a,  Gis,  is  a  survival  function  obtained  by  integrating 
(accumulating)  the  hazard  Ais  over  the  line 


Gis(a|x)  =  exp 


/  AiS(t,x  +  r)dr 
Jo 


(8) 


Denote  by  A2S (a,  t )  the  intensity  of  screening  in  subjects  who  already  had  their  first  screen. 
Generally,  we  expect  A2S  to  be  larger  than  A1S.  Indeed,  the  fact  that  the  subject  has  had  his 
first  PSA  test  may  identify  him  as  a  member  of  the  group  that  is  screened  more  frequently  for 
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reasons  such  as  easier  access  to  secondary  testing  having  done  this  once  already,  favourable 
attitude  towards  screening  in  those  who  choose  to  have  their  first  test,  doctor’s  recommenda¬ 
tions  for  serial  secondary  screens  following  the  first  one,  etc. 

The  model  for  risk  of  diagnosis  by  cancer  screening  is  based  on  the  following  assumptions: 

•  The  probability  that  a  subject  bom  in  year  *  who  has  never  been  screened  by  the  age 
of  a  receives  his  first  screen  in  the  age  interval  (a,  a  +  dn)  is  A\s(a,x  +  a)  da  +  o(da). 

•  The  probability  that  a  subject  bom  in  year  x  who  has  been  screened  at  least  once  by  the 
age  of  a  receives  a  screen  in  the  age  interval  (a,n  +  da)  is  h2%(a,x  +  a )  A  a  +  o(da ).  This 
assumption  defines  secondary  screens  as  following  a  non-homogeneous  Poisson  process 
in  age  with  intensity  A2S(a,x  +  a). 

•  The  probability  that  a  subject  bom  in  year  x,  with  the  disease  onset  at  the  age  of  y, 
screened  at  the  age  of  a  is  detected  with  cancer  is 


0,  y  >a 

y.(a  —  y),  otherwise 


(9) 


where  a(£)  is  the  sensitivity  of  screening,  and  c  is  the  age  of  tumour  at  the  time  of 
testing.  It  is  natural  to  specify  a(£)  as  an  increasing  function. 

It  should  be  noted  that  if  the  whole  screening  schedule  for  a  person  could  be  considered  a 
realization  of  a  non-homogeneous  Poisson  process,  we  would  expect  1iS  =  A2s-  This  reflects 
the  fact  that  the  time  to  any  next  event  in  such  process  is  characterized  by  the  hazard  function 
equal  to  the  intensity  of  the  process.  The  fact  that  Ais  ^  x2s  defies  the  description  of  the  whole 
screening  schedule  for  the  subject  as  a  non-homogeneous  Poisson  process.  In  particular,  given 
the  intensity  of  a  non-homogeneous  Poisson  process,  time  to  the  next  event  depends  only  on 
the  location  of  the  previous  event,  but  not  on  the  number  of  events  that  already  happened.  In 
our  case  it  matters  whether  it  is  a  first  or  secondary  test. 

Consider  the  probability  G2so^{z\x,a,y)  that  a  subject  born  in  year  x,  with  onset  of  the 
disease  at  the  age  of  y  who  has  had  his  first  screen  by  the  age  of  a  is  not  diagnosed  by 
screening  in  the  age  interval  [a,a  +  r],  a  >  y.  Note  that  this  is  a  probability  of  no  event  in  the 
interval  [0,  r]  for  a  non-homogeneous  Poisson  process  in  c  G  [0,  t]  with  intensity  A2s(«  +  i,x  + 
a  +  O  thinned  with  probability  5(c  +a  >’)  —  1  a(£  +  ci  —  y).  (We  use  the  notation  A  —  1  —  A 

for  any  A.)  The  intensity  of  a  Poisson  process  with  intensity  k  thinned  with  probability  a  is 
given  by  the  product  Aoc,  so  that  with  a  >y. 


G2SDx(r\x,a,y)=  expj-^  k2S(a  +  (,x  +  a  +  ()a((  +  a  -  y)dc|  (10) 

If  the  interval  in  question  is  before  onset,  a+z  ^  y,  then  there  is  no  diagnosis  and  G2si)x(r|x, 
y)—  1.  If  a<y  and  a  +  z>y,  the  time  interval  in  t  where  diagnosis  is  possible  starts  at  y  —  a, 
so  that  G2sDx(T|x,a,  y)  is  given  by  an  expression  similar  to  (10)  with  the  lower  limit  in  the 
integral  set  at  y  —  a.  Summarizing,  we  have 


G2Sdx(t|x,  a,  y)  —  exp 


where  J*  =  0  for  any  b  <  a. 


/  max(jt— a,0) 


k2s(a  +  C,x  +  a  +  Oa(C  +  a 


(11) 
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We  are  now  equipped  to  derive  the  probability  of  no  screening  diagnosis  by  the  age  of 
y  +  c,  G’si)X(  C|-r,  y)  conditional  on  year  of  birth  x  and  age  at  disease  onset  y,  where  {  is  time 
since  onset.  We  have 

Gsdx(£\x,  y )  =  Gis(y  +  £\x)  +  Gls(y\x)G2SDx(£\x,  y,  y) 

+  f  a(v)/is(y  +  v|v)G2Sdx(C  —  v\x,  y  +  v,  y)  dv  (12) 

Jo 

The  first  term  in  (12)  addresses  the  possibility  of  no  screens  by  the  age  of  y  +  c  The  second 
term  addresses  the  situation  when  the  first  screen  occurs  before  onset  of  the  disease  at  the 
age  of  y  and  no  diagnosis  is  achieved  through  secondary  screens  that  might  happen  in  the 
age  interval  (y,  y  +  y  ).  The  third  term  accumulates  the  probability  that  cancer  is  missed  at 
the  first  and  secondary  screens  occurring  after  disease  onset. 

2.3.  Composition  of  incident  cancers 

For  an  x-birth  cohort  consider  the  following  random  variables  that  model  various  potential 
(other  risks  removed)  durations  in  the  incidence  model: 

7,  age  at  onset  of  the  disease; 

Wdx,  time  from  onset  to  clinical  diagnosis; 
tsdx,  time  from  onset  to  screening  diagnosis; 
roc,  age  at  death  due  to  other  causes. 

Given  x,  zoc  is  assumed  to  be  independent  of  Y,  zCDx  and  tSdx-  Consider  the  event  of  cancer 
diagnosis  {Dx|Scr}  in  the  presence  of  screening.  We  may  write 

{Dx|Scr}  =  {Y  +  min(TCDx,  tsdx)  <t0c} 

Cases  diagnosed  in  the  presence  of  screening  are  composed  of  two  disjoint  groups, 

{Dx|Scr}  =  {CDx|Scr}  U  {SDx} 

where  {CDx|Scr}  is  the  event  of  clinical  diagnosis  through  symptoms, 

{CDx|Scr}  =  {7  +  TCDx<Toc}n{TSDx>TCDx} 
and  {SDx}  is  the  event  of  screening  diagnosis, 

{SDx}  =  {7  +  tsdx<toc}C{tcdx>tsdx} 

Screen-detected  cases  are  in  turn  composed  of  two  disjoint  groups, 

{SDx}  =  {RDx}  U  {ODx} 

where  {RDx}  is  the  event  of  diagnosis  of  a  relevant  cancer,  and  {ODx}  is  the  event  of 
overdiagnosis.  Relevant  cancer  is  a  case  diagnosed  at  a  screen  such  that  if  screening  results 
on  the  subject  were  ignored,  he  would  still  be  clinically  diagnosed  later  in  his  life  time, 

{RDx}  =  {TsDx<WDx}n{Toc>k  +  TcDx}  (13) 
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SDx 


Figure  2.  Composition  of  detected  cancers  {Dx|Scr};  {SDx},  screen-based  diagnosis;  {DxpScr},  cancer 
diagnosis  without  screening;  {CDx|Scr},  clinical  cancer  diagnosis  in  the  presence  of  screening  (interval 
cases);  {RDx},  screen-detected  cancer  that  would  be  detected  without  screening  (relevant  case);  {ODx}, 
screen-detected  cancer  that  would  not  be  detected  without  screening  (case  of  overdiagnosis). 


Overdiagnosis  is  the  event  of  screen  detection  such  that  the  subject  would  die  from  other 
causes  without  clinical  diagnosis,  if  the  results  of  screening  were  ignored 

{0Dx}  =  {rSDx<rCDx}n{7  +  rCDx  >r0c}  H  {7  +  rSDx<r0c} 

If  screening  were  ignored,  detected  cancers  would  be  represented  by  a  composition  of  two 
disjoint  groups,  relevant  cancers  and  interval  cancers  missed  by  screening, 

{DxpScr}  =  {RDx}  U  {CDx|Scr} 

The  structure  of  detected  cancers  described  above  (see  Figure  2)  can  be  verified  by  elementary 
sets  algebra. 


2. 4.  Overdiagnosis 

2.4.1.  Overdiagnosis  as  a  long-term  outcome.  Overdiagnosis  as  a  population  measure  is  de¬ 
fined  as  a  fraction  of  cancers  among  all  screen  detected  cancers  that  would  not  be  detected  in 
the  absence  of  screening.  Corresponding  to  this  population  statistic  is  the  following  conditional 
probability  obtained  using  the  results  of  Section  2.3: 


Pr{ODx|SDx} 


Pr{ODx} 

Pr{SDx} 


Pr{Dx|Scr}  -  Pr{DxpScr} 
Pr{SDx} 


(14) 


It  is  clear  that  the  above  measure  represents  excess  detection  due  to  introduction  of  screening 
relative  to  all  screen-detected  cases.  Alternatively,  overdiagnosis  and  excess  detection  can  be 
measured  relative  to  all  cancer  cases. 


Pr{ODx|{Dx|Scr}} 


Pr{ODx} 

Pr{Dx|Scr} 


(15) 


As  we  will  see  in  Section  5,  the  two  conditional  probabilities  (14)  and  (15)  show  different 
patterns  of  cohort  effects.  While  overdiagnosis  defined  as  (14)  is  decreasing  as  the  cohort 
life  span  enters  the  screening  era  and  screening  affects  the  oldest  individuals  in  the  cohort, 
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the  conditional  probability  (15)  shows  an  increasing  pattern.  The  crude  probability  of  cancer 
diagnosis  with  screening  in  place  {Dx|Scr}  for  the  x -birth  cohort  is  given  by 

pOO 

Pr{Dx|Scr,x}  =  /  Goc(a\x)f  i(a|x)dn  (16) 

Jo 

where  Goc(a\x )  is  the  survival  function  modelling  death  due  to  other  causes  for  age  a  and 
birth  year  x.  Given  the  hazard  function  Xoc (a,  t  )  by  age  and  calendar  time  available  for  the 
general  population  from  various  sources,  Goc  can  be  computed  using  an  expression  similar  to 
(8).  Likewise,  the  probability  of  cancer  diagnosis  in  the  absence  of  screening  can  be  written  as 

poo 

Pr{DxpScr,x}  =  /  Goc{a\x)fCDx(a\x)&a  (17) 

Jo 


where 

fcDx  —  /  fo(y\x)fcDx(a  -  y\x,y)dy 
Jo 


is  the  net  p.d.f.  of  age  at  clinical  diagnosis  Y  +  tcdx-  The  denominator  of  (14)  is 

poo 

Pr{SDx|Scr}=  /  Goc(a|x)/|Dx(a|x)d(7 
Jo 

where  /|Dx  is  the  crude  density  given  by  (6).  Note  that  overdiagnosis  and  lead  time  (see 
next  section)  considered  as  a  long-term  outcome  are  a  matter  of  prediction  as  well  as  esti¬ 
mation.  Their  evaluation  involves  projecting  population  history  beyond  the  observed  box  for 
the  duration  of  a  lifetime.  For  example,  a  man  turning  50  in  year  2000  may  be  diagnosed  by 
screening  in  year  2004.  Whether  this  man  is  a  case  of  overdiagnosis  depends  on  the  other 
cause  of  death  risk,  trends  in  disease  development  and  other  relevant  population  dynamics 
during  his  potential  future  lifetime  up  to  the  year  of,  say,  2050.  Various  other  definitions 
could  be  proposed  to  make  these  characteristics  less  confounded  by  future  scenarios. 


2.4.2.  Age-specific  presentation.  In  this  subsection  we  consider  a  measure  of  overdiagnosis 
conditional  on  age  and  calendar  year  at  detection.  Age-specific  conditional  p.d.f.  of  overdiag¬ 
nosis  is  proportional  to  the  crude  p.d.f.  of  age  at  the  event  of  overdiagnosis 

pa  poo 

f(a, ODx|x)=/  f0(y\x)fsDx(a-y\x,y)  /  foc{a  +  s\x)GCDx{a  -  y  +  s\x,y)Ay  As  (18) 

the  quantity  f(a,  ODx  |x)  da  representing  a  fraction  of  the  x-cohort  overdiagnosed  at  the  age 
of  a  to  a  +  da.  Here  and  in  the  sequel  we  omit  the  subscript  to  /  when  it  is  clear  from 
its  arguments  what  it  relates  to.  Expression  (18)  represents  a  formula  of  total  probability 
averaging  over  the  following  sequence  of  events,  in  order  of  the  product  in  (18),  onset  at  the 
age  of  y,  screening  diagnosis  at  the  age  of  a,  and  death  from  other  causes  s  years  later  prior 
to  clinical  diagnosis  of  prostate  cancer.  A  similar  crude  p.d.f.  for  screen-based  diagnosis  and 
any  diagnosis  is  given  by 


f(a,  SDx|x)  =  G0C(a|x)/scDx(a|x) 
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and 


fDx(a\x)=  Goc(a\x)fi(a\x) 

respectively.  Finally,  age-specific  probabilities  of  overdiagnosis  in  screen-detected  cases  and 
in  all  prostate  cancer  cases  are  given  by 


/sDx(a,  SDx  x) 

(19) 

Pr{ODx  x,  0,  Dx}  =  toDx(a.ODx\x) 
fox(a\x) 

(20) 

respectively.  Shown  in  Figure  7  (bottom)  are  the  two  age-specific  measures  of  overdiagnosis 
as  estimated  from  SEER  data. 

2.5.  Lead  time 

2.5.1.  Lead  time  as  long-term  outcome.  Lead  time  is  the  time  by  which  diagnosis  of  cancer 
is  advanced  due  to  screening  in  patients  who  would  be  detected  anyway  if  screening  were 
not  applied.  Thus,  lead  time  is  defined  in  the  group  of  patients  detected  with  relevant  cancer 
{RDx}  (see  (13)).  With  this  definition,  we  avoid  the  ambiguity  of  the  lead  time  in  over¬ 
diagnosed  cases.  The  group  of  patients  with  relevant  cancer  is  characterized  by  the  following 
history  of  the  disease: 

•  The  subject  is  bom  in  year  x. 

•  Tumour  onset  occurs  at  the  age  of  y.  It  is  not  interrupted  by  death  due  to  other  causes. 

•  Screen-based  detection  of  the  tumour  occurs  at  the  age  of  y  +  Csnx-  This  event  is  not 
interrupted  by  death  due  to  other  causes. 

•  If  screening  diagnosis  were  ignored,  the  tumour  would  surface  at  the  age  of  y  +  Ccdx 
(clinical  diagnosis),  Caix  >  Csox-  This  event  would  not  be  interrupted  by  death  due  to 
other  causes. 

The  lead  time  is  the  random  variable  Ccdx— £sdx-  Its  existence  is  conditional  on  the  membership 
in  the  {RDx}  group.  Formally,  the  probability  of  {RDx}  for  the  x-birth  cohort  is  given  by 

pOO  pOO 

Pr{RDx|x}  =  /  f0(y\x)  fcD,a\x,y)Goc(y+^\x)GSDx^\x,y)dydc  (21) 

Jo  Jo 

where  Gsdx(£|*,  y)=  1  —  Gsdx(£\x, y)  is  the  net  probability  of  screening  diagnosis  in  {  years 
after  tumour  onset.  The  product  of  probabilities  in  (21)  directly  follows  the  definition  of 
relevant  cancer  (13).  We  can  now  write  the  mean  lead  time  in  x-cohort  as  the  conditional 
expectation 


E{£, CDx  -  £sdx|RDx,x}  = 


p  oo 

fo(y\x)  /  fcD^\x,y)Goc{y  +  £\x) 

Jo 

X  /VsDx(C|*,T)(£-OdydcdC  (22) 

Jo 


Pr{RDx \x}  Jo 

■t 
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Clearly,  it  is  reasonable  to  limit  the  time  span  in  the  improper  integrals  by  the  maximal  human 
lifetime. 


2.5.2.  Age-specific  presentation.  An  age-specific  version  of  the  lead-time  can  be  defined. 
Consider  a  conditional  mean  lead  time,  given  relevant  diagnosis  at  the  age  of  a.  The  crude 
joint  p.d.f.  of  age  (a)  and  lead  time  ( s )  at  relevant  diagnosis  (RDx)  is  represented  as 

/LT(s,a,RDx|x)  =  Goc(a  +  s\x)  /  fo(y\x)fSDx(a-y\x,y)fCDx(a-y  +  s)dy  (23) 

Jo 

where  the  quantity  f  \;\{s,a,  RDx|x)d.vd<7  represents  a  fraction  of  the  x -cohort  having  relevant 
diagnosis  at  the  age  of  a  to  a  +  dn  and  lead  time  .v  to  .y  +  d.v.  The  crude  age  distribution  at 
relevant  diagnosis  is  obtained  by  integrating  the  lead  time  out  of  (23), 


/(a,RDx|x)  =  /  fcris,  a,  RDx|x)d.v 


Now,  the  mean  conditional  lead  time  given  age  at  diagnosis  (a)  and  year  of  diagnosis  (x  +  a ) 
is  given  by 


^{Ccdx  -  £sdx|RDx,x,  a} 


f°°  fLT(s,a,RDx\x) 

Jo  /(a,RDx|x)  5  5 


(24) 


2.5.3.  Potential  lead  time.  Lead  time  defined  in  the  previous  sections  is  conditional  on  the 
event  of  screen-detection  and  on  the  fact  that  this  is  a  relevant  diagnosis  (the  potential  point 
of  clinical  diagnosis  at  the  end  of  sojourn  time  occurs  before  other  causes  interrupt  the  natural 
history).  In  order  to  make  it  a  good  surrogate  of  screening  dissemination,  we  now  broaden  the 
definition  to  all  cancer  cases  and  consider  it  in  the  absence  of  other  causes.  The  absence  of 
other  causes  makes  the  concept  of  relevant  diagnosis  obsolete.  The  lead  time  for  an  interval 
(non-screening)  diagnosis  is  defined  as  zero.  Modifying  the  argument  of  the  previous  section, 
we  have  the  following  expression  for  the  age-specific  potential  lead  time: 

e  ,  ,  fLj(s,a\x) 

£{£cdx  -  £sDx|x,a}=  J  f^alx)  ^  (25) 


where 


fLj(s,a\x)=  /  fo(y\x)fCDx(a- y  +  s\x,y) 
Jo 


GSDx(a  -  y\x,  y),  s  =  0j 

\  d  y 

/sDx(a  -  y\x, y),  S>0J 


3.  LIKELIHOOD 

Observed  data  for  the  incidence  model  is  represented  by  the  following  quantities  available  by 
age  a  and  calendar  year  t  in  a  certain  box: 

•  Population  count  P(a,  t)  of  people  at  risk  of  cancer  development. 

•  Count  of  cancer  cases  detected  in  year  t  at  the  age  of  a,  C{a,t). 
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The  conditional  likelihood  of  the  data  is  built  as  a  product  of  conditional  probabilities  of 
cancer  detection  given  the  subject  is  in  the  risk  set  for  each  a,t  combination  from  the  box. 

L  =  n[l  -  h(a,  (M)-C(a,0Ai(a;  ^C(a,t)  (26) 

a,t 

In  the  above  expression  we  omit  At  =  1  year  from  the  product  X  At.  Taking  the  log,  using  the 
fact  that  k\  is  small,  and  dropping  terms  that  do  not  depend  on  the  model  parameters,  we 
obtain 


£  =  C(a,  t )  log  Xi(a,  t )  —  P(a,  t)li(a,  t )  (27) 

a,t 

Note  that  the  same  likelihood  would  result  if  we  assumed  that  C  is  Poisson  distributed  with 
expectation  PX\  and  that  C(a,t)  represent  independent  random  variables  for  different  (a,t) 
pairs  (which  is  not  the  case  in  (27)).  Maximum  likelihood  inference  is  used  to  obtain  point 
estimates  and  confidence  intervals  for  the  model  parameters  entering  X\.  Maximization  of  the 
likelihood  can  be  regarded  as  minimizing  a  certain  distance  between  the  empirical  incidence 
C/P  and  its  model-based  counterpart  X\. 


4.  SPECIFYING  THE  MODEL 

Since  incidence  of  prostate  cancer  before  the  age  of  50  is  negligibly  small,  we  will  associate 
the  birth  year  x  with  the  year  in  which  the  man  turns  50.  Age  variables  will  be  counted  out 
accordingly  from  this  point. 


4.1.  Age  at  tumour  onset 

We  use  three  parametric  distribution  families  in  our  analysis:  Gamma  distribution,  Weibull 
distribution  and  the  so-called  Moolgavkar,  Venzon,  Knudson  (MVK)  distribution  [16-18]  for 
the  baseline  age  at  tumour  onset.  The  Weibull  baseline  hazard  function  is  given  by 


ho  (y)  =  s0 


r(l  +  l /so) 
Ho 


,-i 


where  y  is  the  age  past  50.  In  the  above  expression  Weibull  distribution  is  parameterized 
through  the  mean  //q  and  the  shape  parameter  .vo  related  to  the  coefficient  of  variation 


r(i  +  (2Ao)) 

r*(l +(!/*>)) 


With  the  Gamma  distribution,  we  have 

s°  e-r*%° 

prcso) 

Gamma  and  Weibull  distributions  are  the  two  convenience  choices. 


fo(y)  = 


ys  o 
Ho 
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The  MVK  distribution  [16-18] 

e(A+B)y  _  i 

hM  =  pAB— 

where  A,B  and  p  are  identifiable  parameters  represents  a  two-stage  mechanistic  model  of 
carcinogenesis.  The  MVK  distribution  has  an  extra  degree  of  freedom  compared  to  the  other 
two  distributions.  Weibull  distribution  showed  the  best  Akaike  information  criterion  (AIC)  in 
the  sensitivity  analysis. 

Included  in  the  model  is  a  trend  function  To(t)  that  depends  on  calendar  time.  This  function 
exerts  a  multiplicative  effect  on  the  baseline  hazard  so  that  the  hazard  of  tumour  onset  depends 
on  age  and  birth  cohort 

^o(y\x)  =  ho(y)T0(x  +  y ) 

The  trend  is  used  to  model  possible  changes  in  the  pattern  of  the  disease  onset  with  calendar 
time  due  to  unspecified  factors  such  as  changes  in  diet,  environment  and  biology  of  the  disease. 
Note  that  it  is  hardly  possible  to  give  a  biological  definition  for  the  tumour  onset.  From  the 
modelling  prospective,  tumour  onset  represents  the  earliest  point  in  time  where  cancer  could 
be  detected  by  screening.  For  this  reason  changes  in  detection  technology,  practice  of  biopsies 
for  the  disease  following  a  positive  screens  and  other  diagnostics  management  issues  may  also 
affect  the  definition.  Changes  in  such  practices  that  are  not  modelled  in  a  mechanistic  fashion 
are  thought  of  as  part  of  the  trend  function.  We  used  truncated  linear  trend  functions  in  data 
analysis. 

4.2.  Sojourn  time  distribution 

Sojourn  time  represents  the  potential  (other  risks  removed)  time  from  tumour  onset  to  its 
clinical  diagnosis.  Weibull  distribution  with  mean  pc Dx  and  shape  parameter  .vCDx  is  used  to 
model  the  baseline  sojourn  time  hazard.  Two  effects  can  be  imposed  on  the  baseline  sojourn 
time  distribution: 

•  Age  dependence.  Sojourn  time  may  be  modulated  by  age  for  various  reasons.  Tumour 
growth  biology  may  depend  on  the  age  of  the  person.  Also,  tumours  developing  at  a 
younger  age  may  represent  a  special  subtype  that  can  have  different  progression  char¬ 
acteristics.  To  model  age  dependency,  the  mean  sojourn  time  is  regressed  on  the  age 
at  tumour  onset  y  as  pcox  cxp(— /CoxT),  where  the  parameter  /fc dx  models  correlation 
between  the  sojourn  time  and  the  onset  time. 

•  Secular  trend.  Sojourn  time  may  be  modulated  by  changes  in  the  practice  of  cancer 
detection  other  than  the  studied  modality  of  screening.  Most  notably,  before  PSA  was 
introduced,  prostate  cancer  was  often  detected  as  a  result  of  surgery  (transurethral  resec¬ 
tion  of  the  prostate,  TURP)  for  benign  prostate  disorders  [19].  Other  changes  in  prostate 
cancer  awareness  in  the  population  and  detection  practices  may  have  contributed  to  a 
trend  of  increasing  incidence  observed  before  PSA  was  introduced.  These  trends  in  calen¬ 
dar  time  are  modelled  using  a  multiplicative  trend  function  7cdx(0  acting  on  the  baseline 
sojourn  time  hazard. 

We  have  the  sojourn  time  hazard  in  the  fonn 

Acd*(£\x,  y)  =  hcDx(£\y)TCDx(x  +  y  +  £)  (28) 
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First  PSA  Test  Secondary  PSA  Test 


Figure  3.  Risks  of  first  2]S  and  secondary  /.2s  PSA  tests  as  estimated  from  the  simulation  model  by  age 
and  calendar  year.  Left:  proportion  of  never  screened  men  at  risk  getting  their  first  PSA  test.  Right: 
proportion  of  men  screened  at  least  once  getting  a  secondary  PSA  test. 


where  x  is  the  birth  year,  y  is  age  (past  50)  at  tumour  onset,  C  is  time  since  tumour  onset, 
and  hCm(£\y)  is  Weibull  hazard  with  shape  parameter  sCdx  and  mean  /jcdx  exp(-/lci)Xy). 

4.3.  PSA  screening  model 

National  Cancer  Institute’s  Statistical  Research  and  Applications  Branch  has  developed  a  sim¬ 
ulator  for  PSA  schedules  for  arbitrary  birth  cohorts  in  the  1916-2000  box.  This  simulator  uses 
data  from  the  National  Health  Interview  Survey  (NHIS)  [20]  and  Surveillance,  Epidemiology 
and  End  Results  (SEER) — Medicare  linked  database  [21],  To  extrapolate  the  data  beyond  the 
original  age-year  box,  generalized  additive  models  (R  procedure  gam)  were  used  to  smooth 
the  data.  A  logistic  regression  model  was  used  for  smoothing  with  the  additive  main  effects 
of  age  a  and  calendar  year  t  represented  by  thin  plate  regression  splines  [22].  No  interaction 
smooth  terms  were  specified.  Shown  in  Figure  3  is  an  estimate  for  the  risks  of  first  2is(a,  t) 
and  secondary  2,2s(frO  PSA  tests.  It  is  clear  from  the  figure  that  the  risk  of  secondary  PSA 
test  is  several  times  higher  the  one  for  the  first  test.  This  observation  prompted  the  develop¬ 
ment  of  the  two-stage  model  for  screening-based  detection  described  in  Section  2.2.  Frequency 
of  PSA  testing  by  age  increases  initially  as  the  man  enters  the  risk  zone  for  prostate  can¬ 
cer.  However  for  the  older  ages  a  decreasing  pattern  is  observed  perhaps  because  of  limited 
residual  life  expectancy  and  associated  diminishing  relevance  of  detection  of  prostate  cancer. 
Dissemination  by  calendar  year  is  different  for  the  first  and  secondary  tests.  In  men  who  have 
been  screened  at  least  once  the  frequency  increases  as  PSA  is  introduced  into  practice  and  the 
surface  settles  at  stable  values  in  the  1990s.  The  risk  of  getting  the  first  test  by  calendar  year 
shows  a  spike  in  early  1990s  and  settles  at  a  lower  level  later  showing  a  decreasing  pattern  in 
the  late  1990s.  This  phenomenon  deserves  further  study.  The  effect  could  be  a  consequence  of 
heterogeneity  in  people’s  acceptance  of  PSA  testing.  The  group  of  men  showing  compliance 
for  PSA  testing  is  dissipating  with  time  as  such  men  get  tested  and  leave  the  set  of  men  ‘at 
risk’  for  the  first  test.  Another  explanation  might  be  that  the  recent  decline  in  the  frequency 
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of  new  PSA  tests  is  associated  with  a  dissemination  of  knowledge  of  various  controversial 
issues  surrounding  screening  for  and  treatment  of  prostate  cancer. 


5.  DATA  ANALYSIS 

SEER  database  was  used  to  obtain  data  on  more  than  350000  cases  of  prostate  cancer  diag¬ 
nosed  in  nine  areas  of  the  U.S.  (San  Francisco-Oakland,  Connecticut,  Detroit,  Hawaii,  Iowa, 
New  Mexico,  Seattle,  Utah,  Atlanta)  as  well  as  population  count  files  corresponding  to  those 
cases.  We  use  the  modelling  box  corresponding  to  age  interval  [50,85]  and  calendar  year 
interval  [1973-2000].  Age  distribution  in  the  U.S.  population  in  year  2000  for  men  over  50 
is  used  as  a  standard  when  age-adjusted  characteristics  are  reported.  Risk  of  death  from  other 
causes  was  derived  from  the  Human  Mortality  Database  [23]. 

As  shown  in  Figure  1 ,  incidence  of  prostate  cancer  before  the  introduction  of  PSA  showed 
an  increasing  trend  in  calendar  time.  This  is  reportedly  due  to  TURP  [19],  a  surgical  treatment 
for  a  benign  enlargement  of  the  prostate.  Incidentally,  many  early  stage  prostate  cancers  were 
discovered  as  a  result  of  TURPS.  With  the  introduction  of  PSA  modelled  from  the  year  of 
1 987,  TURP  rates  rapidly  declined  as  treatment  for  benign  disease  in  the  prostate  was  replaced 
by  non-surgical  alternatives  [19].  In  order  to  model  this  effect,  a  linear  trend  was  specified 
for  the  sojourn  time  model  (28)  for  the  period  1973-1987,  saturating  in  1988 

!1,  t<  1973 

\+c(t  —  1973),  1973  ^  f  ^  1988 

1  +c(1988  -  1973),  t>  1988 

The  parameter  c  specifies  the  slope  of  the  trend.  As  we  are  mainly  interested  in  the  population 
effects  of  PSA  screening,  the  model  parameters  responsible  for  pre-PSA  trends  are  considered 
nuisance  parameters.  Yet  modelling  and  joint  estimation  of  the  pre-PSA  era  parameters  is  im¬ 
portant  as  it  represents  a  reference  baseline  point  for  the  relative  effects  of  PSA.  In  specifying 
PSA  effects  we  were  looking  for  a  simple  trend  function  that  would  allow  us  to  provide  an 
adequate  description  of  cancer  incidence  jointly  for  the  pre-  and  post-PSA  era. 

We  believed  that  changes  in  the  onset  time  distribution  over  a  relatively  short  observation 
window  are  unlikely.  Such  changes  would  mean  that  the  prostate  cancer  has  become  a  different 
disease  over  the  end  of  the  20th  century.  Without  compelling  evidence  we  were  hesitant  to 
include  such  changes  into  the  model,  particularly  since  onset  time  is  unobservable.  The  model 
with  onset  trend  alone  showed  a  much  worse  AIC  than  the  model  where  incidence  trend  was 
addressed  through  sojourn  time  (AIC  4538  versus  510).  Although  introduction  of  both  trends 
showed  the  best  AIC  of  281,  the  estimated  slope  of  the  onset  time  trend  in  this  combined 
model  was  by  two  orders  of  magnitude  smaller  than  that  of  the  sojourn  time  trend  (0.1  versus 
0.004),  which  made  it  too  small  to  be  interpreted.  Also,  the  quality  of  registry  data  and  coding 
practices  have  been  improving,  particularly  in  the  period  from  1973  to  1988  where  incidence 
trend  was  observed.  Based  on  these  considerations  we  decided  to  proceed  without  T0(t)  in 
the  model. 

The  model  assigned  negative  correlation  between  age  at  onset  and  the  sojourn  time  ( />cnx  = 
0.274,  Cl:  (0.272,0.279)).  This  effect  is  attributable  to  underestimation  of  cancer  incidence 
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1975  1985  1995  2005 


Year 


Figure  4.  Left:  time  to  tumour  onset  survival  function  as  estimated  using  Gamma, 
Weibull  and  MVK  families  of  parametric  distributions.  Time  count  starts  at  the  age 
of  50.  Right:  predictions  of  prostate  cancer  age-adjusted  (U.S.  males  in  year  2000) 
incidence  under  various  scenarios  of  PSA  dissemination. 


for  men  over  70  in  the  early  1970s  (Figure  5).  Negative  correlation  is  allowing  for  late  onset 
and  early  detection  thus  increasing  cancer  incidence  in  the  elderly  in  the  absence  of  PSA. 
We  believe  this  is  an  artefact  induced  by  assumed  flat  secular  trends  before  1973.  Since  no 
data  before  1973  was  available,  our  ability  to  address  the  issue  was  limited,  and  we  forced 
/I cdx  —  0  in  the  final  fit.  It  should  be  noted  that  this  left  the  other  model  parameters  and  the 
shape  of  the  incidence  surface  past  1980  practically  unaltered. 

PSA  sensitivity  as  a  function  of  the  age  of  tumour  £  (time  since  tumour  onset)  was  specified 
as  the  following  increasing  function: 

a(£)  =  1  —  exp(— b£),  b>  0 

However,  when  fitting  the  model,  the  estimate  settled  at  an  100  per  cent  sensitive  PSA 
test.  The  profile  likelihood  of  b  is  an  increasing  function  (not  shown).  Therefore,  a  model 
with  100  per  cent  PSA  sensitivity  was  used.  With  respect  to  the  parameter  b,  maximum 
likelihood  estimate  occurred  at  the  border  of  the  parametric  space.  It  is  well  known  that  in 
such  cases  likelihood  ratio  statistic  does  not  generally  follow  the  chi-squared  distribution. 
Basing  model  selection  on  the  AIC  criterion,  we  used  the  general  rule  of  thumb  that  AIC 
needs  to  change  by  more  than  2  in  order  that  models  be  considered  as  different.  This  criterion 
puts  the  uncertainty  in  the  b  parameter  at  the  AlC-confidence  interval  of  [4,  oo).  This  means 
that  the  model  indicates  that  PSA  sensitivity  reaches  half  of  its  maximal  value  in  2  months 
after  onset  or  sooner. 

It  should  be  stressed  that  the  notion  of  onset  time  is  a  mathematical  construct  that  is  difficult 
to  identify  with  specific  in  vivo  biological  processes  leading  to  cancer,  particularly  since  it 
cannot  be  observed.  It  might  be  the  case  that  population  data  provide  limited  information  to 
make  reliable  inference  about  this  unobservable  process.  With  these  considerations  in  mind 
we  conducted  sensitivity  analyses  with  respect  to  the  onset  time  distribution.  It  turned  out 
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Table  I.  Estimates  of  model  parameters  and  confidence  intervals. 


Parameter 

Legend 

Point  estimate 

95%  Cl 

JtCDx 

Mean  baseline  sojourn  time 

18.558 

(18.345,18.775) 

•SCDx 

Shape  sojourn  time 

1.541 

(1.5191,1.5644) 

c 

Slope  of  trend  for  sojourn  time 

0.09354 

(0.09068,0.09641) 

Ho 

Mean  age  past  50  at  tumour  onset 

72.732 

(72.498,72.965) 

so 

Shape  of  age  past  50  at  tumour  onset 

1.6153 

(1.6067,1.6239) 

Time  and  age  is  measured  in  years. 


Observed  Incidence 


Expected  Incidence 


Figure  5.  Prostate  cancer  incidence.  Observed  (left):  empirical  estimate  of  prostate 
cancer  incidence  C(a,t)/P(a,t).  Expected  (right):  model-predicted  prostate  cancer 
incidence  X\(a,t)  by  age  and  calendar  year. 


that  Weibull  distribution  showed  the  best  AIC  (510  for  Weibull  versus  more  than  1000  for 
Gamma  and  MVK).  However,  we  found  that  all  distribution  choices  provided  a  very  similar 
estimate  for  the  age  at  onset  survival  function  (Figure  4).  As  evident  from  the  figure,  the 
shape  of  the  onset  time  distribution  is  fairly  unspectacular,  and  not  much  flexibility  is  needed 
to  reproduce  the  pattern. 

Likelihood  was  maximized  by  the  Powell’s  method  [24]  of  conjugate  directions.  Confidence 
intervals  for  the  model  parameters  are  based  on  likelihood  ratio  and  inverting  of  the  profile 
likelihood  surface  for  each  parameter.  Estimates  of  model  parameters  and  the  corresponding 
confidence  intervals  are  shown  in  Table  I. 

Both  sojourn  time  and  onset  time  are  potential  times  defined  as  if  all  other  risks  were  re¬ 
moved.  Note  that  the  age  at  tumour  onset  goes  well  beyond  the  normal  human  lifetime.  This 
is  a  consequence  of  the  fact  that  only  a  proportion  of  men  would  ever  develop  prostate  can¬ 
cer  in  their  life  span.  Shown  in  Figure  5  is  a  histogram  empirical  estimate  of  prostate  cancer 
incidence  C(cz,  t)/P(a,  t )  and  its  model-predicted  counterpart  A[(a,  t)  by  age  and  calendar  year. 
The  model  captures  the  basic  pattern  of  prostate  cancer  incidence.  The  spike  effect  in  the  inci¬ 
dence  occurring  with  the  introduction  of  PSA  gets  more  pronounced  with  age  except  for  very 
old  people.  This  is  a  consequence  of  latent  prevalence  of  the  disease  accumulating  with  age. 
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Figure  6.  Overdiagnosis  (left)  and  lead-time  (right)  by  birth  cohort.  Dashed  line  is  the 
fraction  of  overdiagnosis  in  screen-detected  patients.  Solid  line  (left)  is  the  fraction  of 

overdiagnosis  in  all  cancer  patients. 


Shown  in  Figure  6  is  an  estimate  of  lead  time  and  overdiagnosis  by  birth  cohort.  Both 
notions  formalized  in  Sections  2.4  and  2.5  relate  to  the  potential  natural  history  of  the  disease 
and  population  screening  exposure  over  the  life  span  of  an  individual.  As  we  move  the  year 
of  birth  to  the  right,  more  and  more  of  the  cohort  life  span  falls  on  the  PSA  dissemination  era. 
This  leads  to  an  increasing  pattern  of  lead  time  and  overdiagnosis  among  all  detected  cancer 
patients  (solid  curves).  For  men  entering  the  age  risk  zone  for  prostate  cancer  at  the  present 
time,  the  model  predicts  about  6-year  mean  lead  time  and  25  per  cent  overdiagnosis  among  all 
detected  patients.  Interestingly,  overdiagnosis  in  screen-detected  cases  is  a  decreasing  function 
of  the  birth  year  and  settles  at  about  30  per  cent  for  the  present  era.  Initially  for  a  person 
bom  in  the  1950s  only  older  ages  are  affected  by  PSA  dissemination.  If  detected  at  such  an 
age,  the  case  is  very  likely  to  be  overdiagnosed.  Indeed,  if  screening  were  ignored  the  disease 
would  have  little  chance  to  surface  because  of  the  very  small  expected  residual  lifetime  in 
older  people.  This  is  why  the  dashed  curve  in  Figure  6  (left)  starts  high.  As  we  move  the 
potential  life  history  more  and  more  under  the  PSA  exposure,  the  pool  of  screen-detected 
cases  gets  enriched  with  relevant  cancers  that  have  advanced  diagnosis  due  to  PSA  yet  would 
surface  clinically  in  their  potential  residual  lifetime  if  PSA  were  not  applied.  Also,  since  (14) 
and  (15)  have  the  same  numerator  and  in  the  denominator  screen-detected  cases  represent  a 
subset  of  all  cancer  cases,  overdiagnosis  relative  to  screen-detected  cases  (the  dashed  curve)  is 
always  higher  than  the  one  relative  to  all  cancer  cases  (the  solid  curve).  Shown  in  Figure  7  are 
the  age-specific  estimates.  These  estimates  are  conditional  on  cancer  diagnosis  at  the  specified 
age.  In  screen-detected  cases  (left  part  of  the  figure),  lead  time  and  overdiagnosis  show  a 
fairly  stationary  age  distribution  by  calendar  year.  Bivariate  distributions  in  all  detected  cases 
(right  part  of  the  figure)  follow  PSA  dissemination  pattern  showing  increasing  lead  time  and 
overdiagnosis  with  the  introduction  of  PSA. 
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Figure  7.  Age-specific  fractions  of  overdiagnosis  (bottom)  and  lead-time  (top).  Top  left  shows  lead  time 
defined  using  the  competing  risk  of  death  due  to  other  causes  (22),  while  top  right  figure  gives  mean 
potential  lead  time  in  the  absence  of  other  causes  (25).  Bottom  left  shows  the  fraction  of  overdiagnosis 
in  screen-detected  cancer  patients  (19),  and  the  bottom  right  figure  refers  to  all  cancer  patients  (20). 


Finally,  we  evaluate  two  hypothetical  scenarios  of  PSA  testing:  no  PSA  screening  versus 
yearly  PSA  tests  for  every  men  over  50  starting  in  1988.  Shown  in  Figure  4,  right,  are 
age -adjusted  predictions  for  the  two  scenarios  as  well  as  the  prediction  for  actual  PSA  dis¬ 
semination  as  estimated  by  the  simulator.  This  figure  gives  a  range  for  possible  incidence 
dynamics  dependent  on  PSA  dissemination  patterns. 


6.  DISCUSSION 

In  this  paper  we  have  presented  a  population  model  of  prostate  cancer  natural  history  and 
screening.  The  model  was  used  to  capture  the  spike  of  prostate  cancer  incidence  registered 
after  introduction  and  dissemination  of  PSA  screening  for  the  disease.  The  link  between 
screening  dissemination  in  the  population,  natural  history  of  the  disease  and  cancer  registration 
statistics  allowed  us  to  predict  lead  time  and  overdiagnosis  of  prostate  cancer.  The  latter  two 
characteristics  were  rigorously  defined  in  the  population  setting  where  screening  schedules  are 
random  and  unknown. 
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An  external  data  source,  NHIS,  was  used  to  inform  the  model  about  recent  frequencies 
of  PSA  testing.  In  a  large  scale  modelling  effort,  relying  on  a  variety  of  data  sources  is 
unavoidable.  However,  this  may  create  problems.  In  particular,  there  might  be  differences 
in  PSA  dissemination  between  SEER  population  (approx  11  per  cent  of  total  U.S.)  and  the 
NHIS  national  base.  Available  data  on  PSA  testing  do  not  allow  us  to  discriminate  between 
a  diagnostic  and  a  screening  PSA  test.  Diagnostic  tests  are  usually  prompted  by  symptoms 
of  an  enlarged  prostate.  Prostate  enlargement  can  be  caused  by  locally  advanced  prostate 
cancer,  or,  more  likely  by  a  benign  disease.  Thus  cancer  discovered  as  a  result  of  PSA 
may  be  a  truly  screening  diagnosis,  clinical  diagnosis  or  incidental  diagnosis.  To  prevent 
misattribution  of  PSA  dissemination,  tests  performed  within  3  months  of  diagnosis  were  con¬ 
sidered  as  diagnostic.  Use  of  external  data  sources  introduces  additional  variability  of  the 
estimates  that  is  difficult  to  control.  Bayesian  methods  might  be  preferable  if  this  variability  is 
substantial. 

Available  data  do  not  provide  explicit  information  on  unobservable  processes  such  as  tu¬ 
mour  onset  and  sensitivity  of  PSA  test  prior  to  diagnosis.  Tumour  onset  can  never  be  ob¬ 
served,  and  SEER  data  do  not  have  information  on  how  many  men  were  tested  in  each 
particular  (age, year)  cell.  This  makes  the  model  estimates  of  the  sensitivity  curve  and  the 
onset  time  distribution  difficult  to  verify  by  the  data.  Sensitivity  is  likely  to  be  technol¬ 
ogy  dependent  which  might  introduce  a  trend  in  the  sensitivity  parameters  in  calendar 
time. 

A  number  of  model  applications  remained  beyond  the  scope  of  the  present  paper.  The 
model  can  be  used  to  adjust  estimates  of  survival  after  treatment  for  the  lead-time  and  over¬ 
diagnosis.  Regression  analysis  of  prostate  cancer  survival  is  confounded  by  the  lead  time  and 
overdiagnosis.  People  diagnosed  under  less  intensive  screening  generally  have  shorter  lead 
times  and  shorter  apparent  survival  times.  Also,  they  are  less  likely  to  be  over-diagnosed 
and  appear  ‘cured’  when  followed  up  for  post-treatment  failure.  In  order  to  adjust  estimated 
treatment  effects  for  such  confounding  conditional  history  of  the  disease  given  presentation  at 
diagnosis  can  be  considered  as  frailty  when  analysing  survival  data.  The  population  incidence 
model  presented  above  can  be  used  to  derive  the  frailty  distribution  as  it  changes  with  year 
of  diagnosis,  age  and  clinical  covariates.  Such  model  development  may  help  reduce  the  biases 
inherent  in  evaluation  of  treatment  effects  using  non-randomized  tumour  registry  data. 

Jointly,  cancer  incidence  and  survival  models  can  be  used  to  build  a  model  of  mortality  that 
has  a  link  to  PSA  dissemination  parameters  and  the  effects  of  treatment  and  screening.  This 
approach  appears  promising  as  a  tool  to  address  the  reasons  for  recent  declines  in  prostate 
cancer  mortality. 

Analysis  of  prostate  cancer  incidence  by  race  indicates  that  race  might  be  a  important 
variable  modulating  the  natural  history  of  the  disease.  Reliable  estimates  are  needed  for 
PSA  dissemination  by  race  in  order  to  address  a  possible  causal  effect  of  race  in  prostate 
cancer. 
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